On Tue, Jun 09, 2009 at 02:12:09AM +0200, Thomas Lotterer wrote: > I need advice hunting down a network problem which I suspect to be > a bug in the vge(4) driver. After spending a lot of time on > investigation, I'm out of ideas > > My recently built new home server running FreeBSD 8.0-CURRENT as of > 2009-06-07 on a VIA ARTiGO A2000 [1] exhibits network problems when > sending more than a couple of dozened kilobytes of TCP traffic. > > The server application is "Dovecot" [2] Secure IMAP server. > The client application is "Thunderbird" [3] running on WindowsXP. > > The high-level view of the problem is that the client seems to stall > downloading messages or even a complex structure of IMAP folder names. > When using STARTTLS the client often prints the infamous generic and > misleading error "Thunderbird received a message with incorrect Message > Authentication Code. If the error occurs frequently, contact the website > administrator". The origin of this message is the SSL library that ships > with Thunderbird. The same library is used for Firefox where the hint > might actually make sense when the user is attempting to access a broken > HTTPS server. After lots of debugging I found out that the same error is > not only printed for TLS/SSL issues but simply also for broken TCP > streams, let it be wrong TCP checksums or a server process dumping core. > So I tried IMAP without TLS just to see the same issue with the > misleading SSL error replaced by an application hang. I ran truss(1) > against Dovecot, placed Thunderbird in debug mode [4] and found out that > during a stall condition the server did write(2) all the data to the TCP > socket but some data did not arrive at the client. > > The low-level view of the problem is that Wireshark on the client side > sooner or later - not for the first few dozened packets - sees a packet > with an incorrect TCP checksum. Usually the next packet is from the > server again, continuing the stream. What follows is an expected but > fruitless attempt of the client sending duplicate ACKs for the last good > packet but the server incorrectly retransmitting more TCP packets with > bad checksums. > > To me it sounds like a broken implementation of hardware generated > checksums. Trying to disable all the "-tso" "-lro" "-txcsum" "-rxcsum" > options and using "polling" option on the server side network interface > did not help. So either something deeper is broken or maybe just the > ability to disable these features needs fixing. Btw, the client using > "VMware Accelerated AMD PCNet Adapter" driver with "TCP/IP Offload=off" > and "TsoEnable=0". > > Sorry to bother you with more details but here's why I believe it's an > hardware/driver issue. Before I purchased the hardware I tried a dry > run. Installed FreeBSD 7.1-RELEASE as VM guest, then upgraded to FreeBSD > 8.0-CURRENT using FreeBSD Administration Toolkit [5]. Built OS and apps > from source, loaded my data - worked! Used the same client that has > problems with the real hardware today. Then used that VM as build host > to create the NanoBSD [6] Flash image for the ARTiGO. Both use exactly > the same sources. The VM works, the metal is broken. One of the few > differences is the NIC and it's driver. As a workaround I copied the VM > to a usual PC equipped with a fxp(4) NIC - worked! So it really looks > like an OS/HW compatibility issue on the ARTiGO. > > In case you are considering a hardware defect please note that before I > loaded the OS, apps and my data to this new hardware I thoroughly tested > what I could. One week filling the disks to the max using repetitive > copies of a file created from /dev/random and, after manually breaking > and rebuilding ZFS mirror, checking data integrity using message > digests. No problems with disks, albeit poor SATA performance, but > that's another story. One day running memtest86 [7]. No problems with > memory. One hour NIC test copying /dev/zero to /dev/null over the wire > using "scp -o compression=no". No hangs or hiccups here. > > Hope you can help me. > I already know there are possible edge-cases in vge(4) but your issue looks quite different one than ever reported. Unfortunately vge(4) hardware I had was broken so I couldn't complete overhauling the vge(4). The code in the following URL is the latest WIP version but I don't know whether it fixes the issue as it wasn't tested at all on real hardware. http://people.freebsd.org/~yongari/vge/if_vge.c http://people.freebsd.org/~yongari/vge/if_vgereg.h http://people.freebsd.org/~yongari/vge/if_vgevar.hReceived on Wed Jun 10 2009 - 00:47:44 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:49 UTC