mbuf cluster leaks in -CURRENT

From: Robert Watson <rwatson_at_FreeBSD.org> Date: Sat, 3 Dec 2005 22:25:03 +0000 (GMT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:48 UTC

Yesterday I sat down to run some benchmarks on phk's changes to the process 
time measurement system for scheduling, and discovered SMP boxes were wedging 
in [zonelimit] when running netperf tests.  I quickly tracked this down to an 
mbuf cluster leak:

   /zoo/rwatson/netperf/bin/netserver
   while (1)
           echo ""
           netstat -m | grep mbuf
           /zoo/rwatson/netperf/bin/netperf -l 30 >& /dev/null
   end

Result of:

CVS Date                Description                             Leak?
2005/12/3               sample                                  yes
2005/11/28-2005/11/29   rwatson sosend changes                  -
2005/11/25              sample                                  yes
2005/11/15              sample                                  yes
2005/11/02-2005/11/05   andre cluster changes                   -
2005/10/25              sample                                  no
2005/10/15              sample                                  no
2005/10/1               sample                                  no
2005/09/27              rwatson removes mbuf counters           -
2005/09/16              sample                                  no

The reason for the wedge is that NFS based systems don't like running out of 
mbuf clusters.  It turns out that the reason I likely didn't notice this 
previously was that I was running the test boxes in question without ACPI, and 
for whatever reason, the race becomes many times more serious with ACPI turned 
on.  It was leaking without ACPI, but since it was slower, I wasn't noticing 
since I had the machines up for much shorter tests.  Here's a sampling of 
kernel dates and whether or not the leak was present in a kernel from the 
date, as well as the dates of a few changes I was worried were likely causes:

769/641/1410 mbufs in use (current/cache/total)
768/204/972/25600 mbuf clusters in use (current/cache/total/max)

769/4991/5760 mbufs in use (current/cache/total)
4341/905/5246/25600 mbuf clusters in use (current/cache/total/max)

769/8456/9225 mbufs in use (current/cache/total)
7901/801/8702/25600 mbuf clusters in use (current/cache/total/max)

769/11786/12555 mbufs in use (current/cache/total)
11242/788/12030/25600 mbuf clusters in use (current/cache/total/max)

769/15236/16005 mbufs in use (current/cache/total)
14570/916/15486/25600 mbuf clusters in use (current/cache/total/max)

769/18566/19335 mbufs in use (current/cache/total)
17948/866/18814/25600 mbuf clusters in use (current/cache/total/max)

I've not really had a chance to investigate the details of the leak -- the 
number of used (allocated) mbufs remains low, but the cache number grows 
steadily.  However, the dates suggest that it was the mbuf cluster cleanup 
work you did that introduced the problem (although don't guarantee it).

Thanks,

Robert N M Watson