Re: [rfc] small bioq patch

From: Maksim Yevmenkin <maksim.yevmenkin_at_gmail.com> Date: Fri, 11 Oct 2013 15:39:53 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:42 UTC

> On Oct 11, 2013, at 2:52 PM, John-Mark Gurney <jmg_at_funkthat.com> wrote:
> 
> Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
>> i would like to submit the attached bioq patch for review and
>> comments. this is proof of concept. it helps with smoothing disk read
>> service times and arrear to eliminates outliers. please see attached
>> pictures (about a week worth of data)
>> 
>> - c034 "control" unmodified system
>> - c044 patched system
> 
> Can you describe how you got this data?  Were you using the gstat
> code or some other code?

Yes, it's basically gstat data. 

> Also, was your control system w/ the patch, but w/ the sysctl set to
> zero to possibly eliminate any code alignment issues?

Both systems use the same code base and build. Patched system has patch included, "control" system does not have the patch. I can rerun my tests with sysctl set to zero and use it as "control". So, the answer to your question is "no". 

>> graphs show max/avg disk read service times for both systems across 36
>> spinning drives. both systems are relatively busy serving production
>> traffic (about 10 Gbps at peak). grey shaded areas on the graphs
>> represent time when systems are refreshing their content, i.e. disks
>> are both reading and writing at the same time.
> 
> Can you describe why you think this change makes an improvement?  Unless
> you're running 10k or 15k RPM drives, 128 seems like a large number.. as
> that's about halve number of IOPs that a normal HD handles in a second..

Our (Netflix) load is basically random disk io. We have tweaked the system to ensure that our io path is "wide" enough, I.e. We read 1mb per disk io for majority of the requests. However offsets we read from are all over the place. It appears that we are getting into situation where larger offsets are getting delayed because smaller offsets are "jumping" ahead of them. Forcing bioq insert tail operation and effectively moving insertion point seems to help avoiding getting into this situation. And, no. We don't use 10k or 15k drives. Just regular enterprise 7200 sata drives. 

> I assume you must be regularly seeing queue depths of 128+ for this
> code to make a difference, do you see that w/ gstat?

No, we don't see large (128+) queue sizes in gstat data. The way I see it, we don't have to have deep queue here. We could just have a steady stream of io requests where new, smaller, offsets consistently "jumping" ahead of older, larger offset. In fact gstat data show shallow queue of 5 or less items.

> Also, do you see a similar throughput of the system?

Yes. We do see almost identical throughput from both systems.  I have not pushed the system to its limit yet, but having much smoother disk read service time is important for us because we use it as one of the components of system health metrics. We also need to ensure that disk io request is actually dispatched to the disk in a timely manner. 

Thanks
Max