Yesterday I sat down to run some benchmarks on phk's changes to the process time measurement system for scheduling, and discovered SMP boxes were wedging in [zonelimit] when running netperf tests. I quickly tracked this down to an mbuf cluster leak: /zoo/rwatson/netperf/bin/netserver while (1) echo "" netstat -m | grep mbuf /zoo/rwatson/netperf/bin/netperf -l 30 >& /dev/null end Result of: CVS Date Description Leak? 2005/12/3 sample yes 2005/11/28-2005/11/29 rwatson sosend changes - 2005/11/25 sample yes 2005/11/15 sample yes 2005/11/02-2005/11/05 andre cluster changes - 2005/10/25 sample no 2005/10/15 sample no 2005/10/1 sample no 2005/09/27 rwatson removes mbuf counters - 2005/09/16 sample no The reason for the wedge is that NFS based systems don't like running out of mbuf clusters. It turns out that the reason I likely didn't notice this previously was that I was running the test boxes in question without ACPI, and for whatever reason, the race becomes many times more serious with ACPI turned on. It was leaking without ACPI, but since it was slower, I wasn't noticing since I had the machines up for much shorter tests. Here's a sampling of kernel dates and whether or not the leak was present in a kernel from the date, as well as the dates of a few changes I was worried were likely causes: 769/641/1410 mbufs in use (current/cache/total) 768/204/972/25600 mbuf clusters in use (current/cache/total/max) 769/4991/5760 mbufs in use (current/cache/total) 4341/905/5246/25600 mbuf clusters in use (current/cache/total/max) 769/8456/9225 mbufs in use (current/cache/total) 7901/801/8702/25600 mbuf clusters in use (current/cache/total/max) 769/11786/12555 mbufs in use (current/cache/total) 11242/788/12030/25600 mbuf clusters in use (current/cache/total/max) 769/15236/16005 mbufs in use (current/cache/total) 14570/916/15486/25600 mbuf clusters in use (current/cache/total/max) 769/18566/19335 mbufs in use (current/cache/total) 17948/866/18814/25600 mbuf clusters in use (current/cache/total/max) I've not really had a chance to investigate the details of the leak -- the number of used (allocated) mbufs remains low, but the cache number grows steadily. However, the dates suggest that it was the mbuf cluster cleanup work you did that introduced the problem (although don't guarantee it). Thanks, Robert N M WatsonReceived on Sat Dec 03 2005 - 21:25:06 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:48 UTC