Re: [rfc] small bioq patch

From: Maksim Yevmenkin <maksim.yevmenkin_at_gmail.com>
Date: Tue, 15 Oct 2013 11:15:24 -0700
On Fri, Oct 11, 2013 at 5:14 PM, John-Mark Gurney <jmg_at_funkthat.com> wrote:
> Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700:
>> > On Oct 11, 2013, at 2:52 PM, John-Mark Gurney <jmg_at_funkthat.com> wrote:
>> >
>> > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
>> >> i would like to submit the attached bioq patch for review and
>> >> comments. this is proof of concept. it helps with smoothing disk read
>> >> service times and arrear to eliminates outliers. please see attached
>> >> pictures (about a week worth of data)
>> >>
>> >> - c034 "control" unmodified system
>> >> - c044 patched system
>> >
>> > Can you describe how you got this data?  Were you using the gstat
>> > code or some other code?
>>
>> Yes, it's basically gstat data.
>
> The reason I ask this is that I don't think the data you are getting
> from gstat is what you think you are...  It accumulates time for a set
> of operations and then divides by the count...  So I'm not sure if the
> stat improvements you are seeing are as meaningful as you might think
> they are...

yes, i'm aware of it. however, i'm not aware of "better" tools. we
also use dtrace and PCM/PMC. ktrace is not particularly useable for us
because it does not really work well when we push system above 5 Gbps.
in order to actually see any "issues" we need to push system to 10
Gbps range at least.

>> >> graphs show max/avg disk read service times for both systems across 36
>> >> spinning drives. both systems are relatively busy serving production
>> >> traffic (about 10 Gbps at peak). grey shaded areas on the graphs
>> >> represent time when systems are refreshing their content, i.e. disks
>> >> are both reading and writing at the same time.
>> >
>> > Can you describe why you think this change makes an improvement?  Unless
>> > you're running 10k or 15k RPM drives, 128 seems like a large number.. as
>> > that's about halve number of IOPs that a normal HD handles in a second..
>>
>> Our (Netflix) load is basically random disk io. We have tweaked the system to ensure that our io path is "wide" enough, I.e. We read 1mb per disk io for majority of the requests. However offsets we read from are all over the place. It appears that we are getting into situation where larger offsets are getting delayed because smaller offsets are "jumping" ahead of them. Forcing bioq insert tail operation and effectively moving insertion point seems to help avoiding getting into this situation. And, no. We don't use 10k or 15k drives. Just regular enterprise 7200 sata drives.
>
> I assume that the 1mb reads are then further broken up into 8 128kb
> reads? so it's more like every 16 reads in your work load that you
> insert the "ordered" io...

i'm not sure where 128kb comes from. are you referring to
MAXPHYS/DLFPHYS? if so, then, no, we have increased *PHYS to 1MB.

> I want to make sure that we choose the right value for this number..
> What number of IOPs are you seeing?

generally we see < 100 IOPs per disk on a system pushing 10+ Gbps.
i've experimented with different numbers on our system and i did not
see much of a difference on our workload. i'm up a value of 1024 now.
higher numbers seem to produce slightly bigger difference between
average and max time, but i do not think its statistically meaningful.
general shape of the curve remains smooth for all tried values so far.

[...]

>> > Also, do you see a similar throughput of the system?
>>
>> Yes. We do see almost identical throughput from both systems.  I have not pushed the system to its limit yet, but having much smoother disk read service time is important for us because we use it as one of the components of system health metrics. We also need to ensure that disk io request is actually dispatched to the disk in a timely manner.
>
> Per above, have you measured at the application layer that you are
> getting better latency times on your reads?  Maybe by doing a ktrace
> of the io, and calculating times between read and return or something
> like that...

ktrace is not particularly useful. i can see if i can come up with
dtrace probe or something. our application (or rather clients) are
_very_ sensitive to latency. having read service times outliers is not
very good for us.

> Have you looked at the geom disk schedulers work that Luigi did a few
> years back?  There have been known issues w/ our io scheduler for a
> long time...  If you search the mailing lists, you'll see lots of
> reports from some processes starving out others, probably due to a
> similar issue...  I've seen similar unfair behavior between processes,
> but spend time tracking it down...

yes, we have looked at it. it makes things worse for us, unfortunately.

> It does look like a good improvement though...
>
> Thanks for the work!

ok :) i'm interested to hear from people who have different workload
profile. for example lots of iops, i.e. very small files reads or
something like that.

thanks,
max
Received on Tue Oct 15 2013 - 16:15:26 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:43 UTC