On Sun, 6 Jan 2008, Ivan Voras wrote: > Robert Watson wrote: > >> Actually, with mbuma, this has changed -- mbufs are now allocated from the >> general kernel map. Pipe buffer memory and a few other things are still >> allocated from separate maps, however. In fact, this was one of the known >> issues with the introduction of large cluster sizes without resource >> limits: address space and memory use were potentially unbounded, so Randall >> recently properly implemented the resource limits on mbuf clusters of large >> sizes. > > Is this related to reported panics with ZFS and a heavy network load (NFS > mostly)? Handling resource exhaustion is a tricky issue, because sometimes it takes resources to make resources available. In the presence of a really greedy (that is to say, effectively leaking) subsystem, there isn't really any way to recover. There are really two alternatives: deadlock (no resources are available, so no progress can be made) or panic (no resources are available so do the only thing we can). Subsystems are relied upon to impose their own limits, or at least provide those limits to UMA so that UMA can impose them, as "appropriate" limits are entirely dependent on context. It's indeed the case that the more load the system is under, the more resources are in use, and therefore the lower the threshold for any particular system to contribute to a potential exhaustion of resources. If the network is at a very high watermark, then indeed ZFS has to use less to exhaust it. Normally, subsystems like the network stack and file systems rely on "back pressure" to cause them to release memory -- the network stack largely allocates using UMA, so the VM low memory event frees up its caches, and it also implements its own per-protocol low memory handlers, doing things like discarding TCP reassembly buffers, etc. VM also knows to discard un-dirtied pages. Pawel has a patch to make ZFS more agressively call low memory event handlers when it gets a bit too greedy, which I saw in the re_at_ MFC queue yesterday, it you might find this improves behavior a bit more. However, things do get quite tricky when you're low on resources, because you waiting indefinitely for resources rather than panicking may actually be worse, because the system may never recover. That's why constaining initial resource and responding to back pressure early is critical, in order to avoid getting into situations where the only possible response is to hang or panic. There's an interesting paper by Gibson, et al, from CMU on economic models for "investing" memory pages in different sorts of cache -- prefetch, read-ahead, buffer cache, etc, and is a good read for getting a grasp of just how tricky the balance is to find. Robert N M Watson Computer Laboratory University of CambridgeReceived on Tue Jan 08 2008 - 08:20:00 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:25 UTC