Re: 6-CURRENT Network stack issues w/SMP? (Was: Re: TreeListfailed: Network write failure: ChannelMux.ProtocolError)

From: Andre Guibert de Bruet <andy_at_siliconlandmark.com>
Date: Sun, 12 Sep 2004 12:25:49 -0400 (EDT)
On Sun, 12 Sep 2004, Robert Watson wrote:
> On Sun, 12 Sep 2004, Andre Guibert de Bruet wrote:
>> On Sun, 12 Sep 2004, Kris Kennaway wrote:
>>> On Sun, Sep 12, 2004 at 02:42:03AM -0400, Andre Guibert de Bruet wrote:
>>>
>>>>> I've also noticed data corruption in the form of failed CRCs (And hence
>>>>> dropped SSH connections) while transferring large amounts of data via SSH
>>>>> over gige to a machine on its subnet. These problems started occuring
>>>>> after the giant-less networking megacommit. Older kernels check out
>>>>> without any such issues.
>>>
>>> Does it go away if you turn off debug.mpsafenet?  If not, it's
>>> probably not related to that commit.
>>
>> Setting debug.mpsafenet to 0 allows the SSH transfers to complete. The
>> MD5 checksums and sizes match. Where do we go from here?
>
> I think I'd look at the following next:
>
> - Does your network interface driver support checksum offload?  If so,
>  what happens if you disable that?

It appears that it does, based on the options field reported by ifconfig:
nge0: flags=108843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
         options=13<RXCSUM,TXCSUM,VLAN_HWTAGGING>

I can still reproduce the problem after passing -rxcsum and -txcsum while 
bringing the interface up.

> - Is the network interface driver marked as INTR_MPSAFE and/or not
>  IFF_NEEDSGIANT.  If either, try setting the driver to run with Giant by
>  removing INTR_MPSAFE and adding IFF_NEEDSGIANT.

dev/nge/if_nge.c has the interface marked as IFF_NEEDSGIANT, with no 
trace of INTR_MPSAFE. My dmesg confirms this: "nge0: [GIANT-LOCKED]"

> After that I think we want to try and produce a non-SSH reproduction
> scenario using a very simple test program...

Attempting to bring a local FreeBSD repo up-to-date causes the issue to 
manifest itself. If portupgrade is run and execs a fetch for a large 
tarball from a fast mirror (100KB/s+), the problem manifests itself as 
well.

I cannot yet make any conclusive determination, but preliminary pattern 
analysis seems to indicate that large bursts of network traffic on this 
gige interface aid the reproduction of this condition. The machine in 
question acts as a dns resolver for my small home network and appears to 
handle light amounts of traffic without any issues.

Thanks for the help,
Andy

| Andre Guibert de Bruet | Enterprise Software Consultant >
| Silicon Landmark, LLC. | http://siliconlandmark.com/    >
Received on Sun Sep 12 2004 - 14:25:54 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:11 UTC