Re: access to hard drives is "blocked" by writes to a flash drive

From: Don Lewis <truckman_at_FreeBSD.org> Date: Sun, 3 Mar 2013 23:12:40 -0800 (PST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:35 UTC

On  4 Mar, Konstantin Belousov wrote:
> On Sun, Mar 03, 2013 at 07:01:27PM -0800, Don Lewis wrote:
>> On  3 Mar, Poul-Henning Kamp wrote:
>> 
>> > For various reasons (see: Lemming-syncer) FreeBSD will block all I/O
>> > traffic to other disks too, when these pileups gets too bad.
>> 
>> The Lemming-syncer problem should have mostly been fixed by 231160 in
>> head (231952 in stable/9 and 231967 in stable/8) a little over a year
>> ago. The exceptions are atime updates, mmaped files with dirty pages,
>> and quotas. Under certain workloads I still notice periodic bursts of
>> seek noise. After thinking about it for a bit, I suspect that it could
>> be atime updates, but I haven't tried to confirm that.
> I never got a definition what a Lemming syncer term means. The (current)
> syncer model is to iterate over the list of the active vnodes, i.e.
> vnodes for which an open file exists, or a mapping is established, and
> initiate the neccessary writes. The iterations over the active list is
> performed several times during the same sync run over the filesystem,
> this is considered acceptable.

Prior to 231160, the syncer thread would call sync_vnode() for the
syncer vnode of each mountpoint every 30 seconds, depending on where the
syncer vnode was in the worklist.  sync_vnode() would in turn call
VOP_FSYNC(-, MNT_LAZY, -) on the syncer vnode, which got mapped to
sync_fsync().  sync_fsync() would then call vfs_msync() to perform an
msync on all vnodes under the mount point, and then called VFS_SYNC(-,
MNT_LAZY), which maps to a call to ffs_sync().  ffs_sync() would then
ignore the MNT_LAZY flage and just blindly iterate over all the vnodes
owned by the mountpoint, calling ffs_syncvnode() on any of them that had
an IN_* flag set or had any dirty buffers.  The result is that once
every 30 seconds all of the dirty files under a mount point would be
flushed to disk in one big blast.

There have been a lot of improvements with 231160 and later changes, but
I still notice periodic increases in seek noise under some workloads,
but I haven't had a chance to investigate.

> (Mostly) independently, the syncer thread iterates over the list of the
> dirty buffers and writes them.

It did that as well, prior to 231160.

> The "wdrain" wait is independend from the syncer model used. It is entered
> by a thread which intends to write in some future, but the wait is performed
> before the entry into VFS is performed, in particular, before any VFS
> resources are acquired. The wait sleeps when the total amount of the
> buffer space for which the writes are active (runningbufspace counter)
> exceeds the hirunningbufspace threshold. This way buffer cache tries to
> avoid creating too long queue of the write requests.
> 
> If there is some device which has high latency with the write completion,
> then it is easy to see that, for the load which creates intensive queue
> of writes to the said device, regardless of amount of writes to other
> devices, runningbufspace quickly gets populated with the buffers targeted
> to the slow device.  Then, the "wdrain" wait mechanism kicks in, slowing
> all writers until the queue is processed.

Reserving at least sume space for each device might be beneficial to
prevent one slow device from blocking writes to others, but in this case
it sounds like reads are also getting blocked.

> It could be argued that the current typical value of 16MB for the
> hirunningbufspace is too low, but experiments with increasing it did
> not provided any measureable change in the throughput or latency for
> some loads.

The correct value is probably proportional to the write bandwidth
available.

> And, just to wrestle with the misinformation, the unmapped buffer work
> has nothing to do with either syncer or runningbufspace.
> 
>> 
>> When using TCQ or NCQ, perhaps we should limit the number of outstanding
>> writes per device to leave some slots open for reads.  We should
>> probably also prioritize reads over writes unless we are under memory
>> pressure.
> 
> Reads are allowed to start even when the runningbufspace is overflown.

I think reads are probably correctly self limiting in low memory
situations because they'll be blocked when trying to allocate buffer
space to read the data into.