Re: When will ZFS become stable?

From: Robert Watson <rwatson_at_FreeBSD.org> Date: Tue, 8 Jan 2008 09:19:59 +0000 (GMT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:25 UTC

On Sun, 6 Jan 2008, Ivan Voras wrote:

> Robert Watson wrote:
>
>> Actually, with mbuma, this has changed -- mbufs are now allocated from the 
>> general kernel map.  Pipe buffer memory and a few other things are still 
>> allocated from separate maps, however.  In fact, this was one of the known 
>> issues with the introduction of large cluster sizes without resource 
>> limits: address space and memory use were potentially unbounded, so Randall 
>> recently properly implemented the resource limits on mbuf clusters of large 
>> sizes.
>
> Is this related to reported panics with ZFS and a heavy network load (NFS 
> mostly)?

Handling resource exhaustion is a tricky issue, because sometimes it takes 
resources to make resources available.  In the presence of a really greedy 
(that is to say, effectively leaking) subsystem, there isn't really any way to 
recover.  There are really two alternatives: deadlock (no resources are 
available, so no progress can be made) or panic (no resources are available so 
do the only thing we can).  Subsystems are relied upon to impose their own 
limits, or at least provide those limits to UMA so that UMA can impose them, 
as "appropriate" limits are entirely dependent on context.  It's indeed the 
case that the more load the system is under, the more resources are in use, 
and therefore the lower the threshold for any particular system to contribute 
to a potential exhaustion of resources.  If the network is at a very high 
watermark, then indeed ZFS has to use less to exhaust it.

Normally, subsystems like the network stack and file systems rely on "back 
pressure" to cause them to release memory -- the network stack largely 
allocates using UMA, so the VM low memory event frees up its caches, and it 
also implements its own per-protocol low memory handlers, doing things like 
discarding TCP reassembly buffers, etc.  VM also knows to discard un-dirtied 
pages.  Pawel has a patch to make ZFS more agressively call low memory event 
handlers when it gets a bit too greedy, which I saw in the re_at_ MFC queue 
yesterday, it you might find this improves behavior a bit more.  However, 
things do get quite tricky when you're low on resources, because you waiting 
indefinitely for resources rather than panicking may actually be worse, 
because the system may never recover.  That's why constaining initial resource 
and responding to back pressure early is critical, in order to avoid getting 
into situations where the only possible response is to hang or panic.

There's an interesting paper by Gibson, et al, from CMU on economic models for 
"investing" memory pages in different sorts of cache -- prefetch, read-ahead, 
buffer cache, etc, and is a good read for getting a grasp of just how tricky 
the balance is to find.

Robert N M Watson
Computer Laboratory
University of Cambridge