Re: suspect bug in vge(4)

From: Pyun YongHyeon <pyunyh_at_gmail.com>
Date: Wed, 10 Jun 2009 11:49:59 +0900
On Tue, Jun 09, 2009 at 02:12:09AM +0200, Thomas Lotterer wrote:
> I need advice hunting down a network problem which I suspect to be
> a bug in the vge(4) driver. After spending a lot of time on
> investigation, I'm out of ideas
> 
> My recently built new home server running FreeBSD 8.0-CURRENT as of
> 2009-06-07 on a VIA ARTiGO A2000 [1] exhibits network problems when
> sending more than a couple of dozened kilobytes of TCP traffic.
> 
> The server application is "Dovecot" [2] Secure IMAP server.
> The client application is "Thunderbird" [3] running on WindowsXP.
> 
> The high-level view of the problem is that the client seems to stall
> downloading messages or even a complex structure of IMAP folder names.
> When using STARTTLS the client often prints the infamous generic and
> misleading error "Thunderbird received a message with incorrect Message
> Authentication Code. If the error occurs frequently, contact the website
> administrator". The origin of this message is the SSL library that ships
> with Thunderbird. The same library is used for Firefox where the hint
> might actually make sense when the user is attempting to access a broken
> HTTPS server. After lots of debugging I found out that the same error is
> not only printed for TLS/SSL issues but simply also for broken TCP
> streams, let it be wrong TCP checksums or a server process dumping core.
> So I tried IMAP without TLS just to see the same issue with the
> misleading SSL error replaced by an application hang. I ran truss(1)
> against Dovecot, placed Thunderbird in debug mode [4] and found out that
> during a stall condition the server did write(2) all the data to the TCP
> socket but some data did not arrive at the client.
> 
> The low-level view of the problem is that Wireshark on the client side
> sooner or later - not for the first few dozened packets - sees a packet
> with an incorrect TCP checksum. Usually the next packet is from the
> server again, continuing the stream. What follows is an expected but
> fruitless attempt of the client sending duplicate ACKs for the last good
> packet but the server incorrectly retransmitting more TCP packets with
> bad checksums.
> 
> To me it sounds like a broken implementation of hardware generated
> checksums. Trying to disable all the "-tso" "-lro" "-txcsum" "-rxcsum"
> options and using "polling" option on the server side network interface
> did not help. So either something deeper is broken or maybe just the
> ability to disable these features needs fixing. Btw, the client using
> "VMware Accelerated AMD PCNet Adapter" driver with "TCP/IP Offload=off"
> and "TsoEnable=0".
> 
> Sorry to bother you with more details but here's why I believe it's an
> hardware/driver issue. Before I purchased the hardware I tried a dry
> run. Installed FreeBSD 7.1-RELEASE as VM guest, then upgraded to FreeBSD
> 8.0-CURRENT using FreeBSD Administration Toolkit [5]. Built OS and apps
> from source, loaded my data - worked! Used the same client that has
> problems with the real hardware today. Then used that VM as build host
> to create the NanoBSD [6] Flash image for the ARTiGO. Both use exactly
> the same sources. The VM works, the metal is broken. One of the few
> differences is the NIC and it's driver. As a workaround I copied the VM
> to a usual PC equipped with a fxp(4) NIC - worked! So it really looks
> like an OS/HW compatibility issue on the ARTiGO.
> 
> In case you are considering a hardware defect please note that before I
> loaded the OS, apps and my data to this new hardware I thoroughly tested
> what I could. One week filling the disks to the max using repetitive
> copies of a file created from /dev/random and, after manually breaking
> and rebuilding ZFS mirror, checking data integrity using message
> digests. No problems with disks, albeit poor SATA performance, but
> that's another story. One day running memtest86 [7]. No problems with
> memory. One hour NIC test copying /dev/zero to /dev/null over the wire
> using "scp -o compression=no". No hangs or hiccups here.
> 
> Hope you can help me.
> 

I already know there are possible edge-cases in vge(4) but your
issue looks quite different one than ever reported. Unfortunately
vge(4) hardware I had was broken so I couldn't complete overhauling
the vge(4). The code in the following URL is the latest WIP version
but I don't know whether it fixes the issue as it wasn't tested at
all on real hardware.
http://people.freebsd.org/~yongari/vge/if_vge.c
http://people.freebsd.org/~yongari/vge/if_vgereg.h
http://people.freebsd.org/~yongari/vge/if_vgevar.h
Received on Wed Jun 10 2009 - 00:47:44 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:49 UTC