Re: FreeBSD 8.0 - network stack crashes?

From: Gavin Atkinson <gavin_at_FreeBSD.org> Date: Tue, 03 Nov 2009 15:13:34 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:57 UTC

On Tue, 2009-11-03 at 08:32 -0500, Weldon S Godfrey 3 wrote:
> 
> If memory serves me right, sometime around Yesterday, Gavin Atkinson told me:
> 
> Gavin, thank you A LOT for helping us with this, I have answered as much 
> as I can from the most recent crash below.  We did hit max mbufs.  It is 
> at 25Kclusters, which is the default.  I have upped it to 32K because a 
> rather old article mentioned that as the top end and I need to get into 
> work so I am not trying to do this with a remote console to go higher.  I 
> have already set it to reboot next with 64K clusters.  I already have kmem 
> maxed to what is bootable (or at least at one time) in 8.0, 4GB, how high 
> can I safely go?  This is a NFS server running ZFS with sustained 5 min 
> averages of 120-200Mb/s running as a store for a mail system.
> 
> > Some things that would be useful:
> >
> > - Does "arp -da" fix things?
> 
> no, it hangs like ssh, route add, etc
> 
> > - What's the output of "netstat -m" while the networking is broken?
> Tue Nov  3 07:02:11 CST 2009
> 36971/2033/39004 mbufs in use (current/cache/total)
> 24869/731/25600/25600 mbuf clusters in use (current/cache/total/max)
> 24314/731 mbuf+clusters out of packet secondary zone in use 
> (current/cache)
> 0/35/35/12800 4k (page size) jumbo clusters in use 
> (current/cache/total/max)
> 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
> 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
> 58980K/2110K/61091K bytes allocated to network (current/cache/total)
> 0/201276/90662 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> 0/0/0 sfbufs in use (current/peak/max)
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 0 calls to protocol drain routines

OK, at least we've figured out what is going wrong then.  As a
workaround to get the machine to stay up longer, you should be able to
set kern.ipc.nmbclusters=256000 in /boot/loader.conf -but hopefully we
can resolve this soon.

Firstly, what kernel was the above output from?  And what network card
are you using?  In your initial post you mentioned testing both bce(4)
and em(4) cards, be aware that em(4) had an issue that would cause
exactly this issue, which was fixed with a commit on September 11th
(r197093).  Make sure your kernel is from after that date if you are
using em(4).  I guess it is also possible that bce(4) has the same
issue, I'm not aware of any fixes to it recently.

So, from here, I think the best thing would be to just use the em(4) NIC
and an up-to-date kernel, and see if you can reproduce the issue.

How important is this machine?  If em(4) works, are you able to help
debug the issues with the bce(4) driver?

Thanks,

Gavin