Re: [rfc] small bioq patch

From: John-Mark Gurney <jmg_at_funkthat.com> Date: Fri, 11 Oct 2013 17:14:10 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:42 UTC

Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700:
> > On Oct 11, 2013, at 2:52 PM, John-Mark Gurney <jmg_at_funkthat.com> wrote:
> > 
> > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
> >> i would like to submit the attached bioq patch for review and
> >> comments. this is proof of concept. it helps with smoothing disk read
> >> service times and arrear to eliminates outliers. please see attached
> >> pictures (about a week worth of data)
> >> 
> >> - c034 "control" unmodified system
> >> - c044 patched system
> > 
> > Can you describe how you got this data?  Were you using the gstat
> > code or some other code?
> 
> Yes, it's basically gstat data. 

The reason I ask this is that I don't think the data you are getting
from gstat is what you think you are...  It accumulates time for a set
of operations and then divides by the count...  So I'm not sure if the
stat improvements you are seeing are as meaningful as you might think
they are...

> > Also, was your control system w/ the patch, but w/ the sysctl set to
> > zero to possibly eliminate any code alignment issues?
> 
> Both systems use the same code base and build. Patched system has patch included, "control" system does not have the patch. I can rerun my tests with sysctl set to zero and use it as "control". So, the answer to your question is "no". 

I don't believe the code would make a difference, but more wanted to
know what control was...

> >> graphs show max/avg disk read service times for both systems across 36
> >> spinning drives. both systems are relatively busy serving production
> >> traffic (about 10 Gbps at peak). grey shaded areas on the graphs
> >> represent time when systems are refreshing their content, i.e. disks
> >> are both reading and writing at the same time.
> > 
> > Can you describe why you think this change makes an improvement?  Unless
> > you're running 10k or 15k RPM drives, 128 seems like a large number.. as
> > that's about halve number of IOPs that a normal HD handles in a second..
> 
> Our (Netflix) load is basically random disk io. We have tweaked the system to ensure that our io path is "wide" enough, I.e. We read 1mb per disk io for majority of the requests. However offsets we read from are all over the place. It appears that we are getting into situation where larger offsets are getting delayed because smaller offsets are "jumping" ahead of them. Forcing bioq insert tail operation and effectively moving insertion point seems to help avoiding getting into this situation. And, no. We don't use 10k or 15k drives. Just regular enterprise 7200 sata drives. 

I assume that the 1mb reads are then further broken up into 8 128kb
reads? so it's more like every 16 reads in your work load that you
insert the "ordered" io...

I want to make sure that we choose the right value for this number..
What number of IOPs are you seeing?

> > I assume you must be regularly seeing queue depths of 128+ for this
> > code to make a difference, do you see that w/ gstat?
> 
> No, we don't see large (128+) queue sizes in gstat data. The way I see it, we don't have to have deep queue here. We could just have a steady stream of io requests where new, smaller, offsets consistently "jumping" ahead of older, larger offset. In fact gstat data show shallow queue of 5 or less items.

Sorry, I miss read the patch the first time...  After rereading it,
the short summary is that if there hasn't been an ordered bio
(bioq_insert_tail) after 128 requests, the next request will be
"ordered"...

> > Also, do you see a similar throughput of the system?
> 
> Yes. We do see almost identical throughput from both systems.  I have not pushed the system to its limit yet, but having much smoother disk read service time is important for us because we use it as one of the components of system health metrics. We also need to ensure that disk io request is actually dispatched to the disk in a timely manner. 

Per above, have you measured at the application layer that you are
getting better latency times on your reads?  Maybe by doing a ktrace
of the io, and calculating times between read and return or something
like that...

Have you looked at the geom disk schedulers work that Luigi did a few
years back?  There have been known issues w/ our io scheduler for a
long time...  If you search the mailing lists, you'll see lots of
reports from some processes starving out others, probably due to a
similar issue...  I've seen similar unfair behavior between processes,
but spend time tracking it down...

It does look like a good improvement though...

Thanks for the work!

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."