suspect bug in vge(4)

From: Thomas Lotterer <thomas+freebsd_at_lotterer.net>
Date: Tue, 09 Jun 2009 02:12:09 +0200
I need advice hunting down a network problem which I suspect to be
a bug in the vge(4) driver. After spending a lot of time on
investigation, I'm out of ideas

My recently built new home server running FreeBSD 8.0-CURRENT as of
2009-06-07 on a VIA ARTiGO A2000 [1] exhibits network problems when
sending more than a couple of dozened kilobytes of TCP traffic.

The server application is "Dovecot" [2] Secure IMAP server.
The client application is "Thunderbird" [3] running on WindowsXP.

The high-level view of the problem is that the client seems to stall
downloading messages or even a complex structure of IMAP folder names.
When using STARTTLS the client often prints the infamous generic and
misleading error "Thunderbird received a message with incorrect Message
Authentication Code. If the error occurs frequently, contact the website
administrator". The origin of this message is the SSL library that ships
with Thunderbird. The same library is used for Firefox where the hint
might actually make sense when the user is attempting to access a broken
HTTPS server. After lots of debugging I found out that the same error is
not only printed for TLS/SSL issues but simply also for broken TCP
streams, let it be wrong TCP checksums or a server process dumping core.
So I tried IMAP without TLS just to see the same issue with the
misleading SSL error replaced by an application hang. I ran truss(1)
against Dovecot, placed Thunderbird in debug mode [4] and found out that
during a stall condition the server did write(2) all the data to the TCP
socket but some data did not arrive at the client.

The low-level view of the problem is that Wireshark on the client side
sooner or later - not for the first few dozened packets - sees a packet
with an incorrect TCP checksum. Usually the next packet is from the
server again, continuing the stream. What follows is an expected but
fruitless attempt of the client sending duplicate ACKs for the last good
packet but the server incorrectly retransmitting more TCP packets with
bad checksums.

To me it sounds like a broken implementation of hardware generated
checksums. Trying to disable all the "-tso" "-lro" "-txcsum" "-rxcsum"
options and using "polling" option on the server side network interface
did not help. So either something deeper is broken or maybe just the
ability to disable these features needs fixing. Btw, the client using
"VMware Accelerated AMD PCNet Adapter" driver with "TCP/IP Offload=off"
and "TsoEnable=0".

Sorry to bother you with more details but here's why I believe it's an
hardware/driver issue. Before I purchased the hardware I tried a dry
run. Installed FreeBSD 7.1-RELEASE as VM guest, then upgraded to FreeBSD
8.0-CURRENT using FreeBSD Administration Toolkit [5]. Built OS and apps
from source, loaded my data - worked! Used the same client that has
problems with the real hardware today. Then used that VM as build host
to create the NanoBSD [6] Flash image for the ARTiGO. Both use exactly
the same sources. The VM works, the metal is broken. One of the few
differences is the NIC and it's driver. As a workaround I copied the VM
to a usual PC equipped with a fxp(4) NIC - worked! So it really looks
like an OS/HW compatibility issue on the ARTiGO.

In case you are considering a hardware defect please note that before I
loaded the OS, apps and my data to this new hardware I thoroughly tested
what I could. One week filling the disks to the max using repetitive
copies of a file created from /dev/random and, after manually breaking
and rebuilding ZFS mirror, checking data integrity using message
digests. No problems with disks, albeit poor SATA performance, but
that's another story. One day running memtest86 [7]. No problems with
memory. One hour NIC test copying /dev/zero to /dev/null over the wire
using "scp -o compression=no". No hangs or hiccups here.

Hope you can help me.

     **** manually trimmed/shaped server details ****

# uname -a
FreeBSD [...] 8.0-CURRENT FreeBSD 8.0-CURRENT #0: Sun Jun  7 13:09:44
CEST 2009     root_at_[...]:/usr/obj/nanobsd/usr/src/sys/VIAARTIGOA2000  i386

# dmesg
CPU: VIA C7-D Processor 1500MHz (1499.85-MHz 686-class CPU)
   Origin = "CentaurHauls"  Id = 0x6d0  Stepping = 0
Features=0xa7c9bbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,CMOV,PAT,CLFLUSH,ACPI,MMX,FXSR,SSE,SSE2,TM,PBE>
   Features2=0x4001<SSE3,xTPR>
   VIA Padlock Features=0xffcc<RNG,AES,AES-CTR,SHA1,SHA256,RSA>
real memory  = 2147483648 (2048 MB)
avail memory = 2031333376 (1937 MB)
ACPI APIC Table: <VX800  AWRDACPI>
ioapic0 <Version 0.3> irqs 0-23 on motherboard
ioapic1 <Version 0.3> irqs 24-47 on motherboard
acpi0: <VX800 AWRDACPI> on motherboard
pci0: <ACPI PCI bus> on pcib0
vgapci0: <VGA-compatible display> mem
0xd8000000-0xdbffffff,0xde000000-0xdeffffff,0xc0000000-0xcfffffff at
device 1.0 on pci0
pcib1: <ACPI PCI-PCI bridge> irq 27 at device 2.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> irq 31 at device 3.0 on pci0
pci2: <ACPI PCI bus> on pcib2
vge0: <VIA Networking Gigabit Ethernet> port 0xec00-0xecff mem
0xdf7ff000-0xdf7ff0ff irq 28 at device 0.0 on pci2
miibus0: <MII bus> on vge0
ip1000phy0: <IC Plus IP1001 10/100/1000 media interface> PHY 22 on miibus0
ip1000phy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
vge0: WARNING: using obsoleted if_watchdog interface
vge0: Ethernet address: 00:40:63:xx:xx:xx

# after boot
# ifconfig vge0
vge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST>
       metric 0 mtu 1500
         options=1b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING>
         ether 00:40:63:xx:xx:xx
         inet [...]
         media: Ethernet autoselect
                (1000baseT <full-duplex,flag0,flag1,flag2>)
         status: active

# after adding options "-tso" "-lro" "-txcsum" "-rxcsum" "polling" and
trying after each one the final result is
# ifconfig vge0
vge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST>
       metric 0 mtu 1500
         options=18<VLAN_MTU,VLAN_HWTAGGING>
         ether 00:40:63:xx:xx:xx
         inet [...]
         media: Ethernet autoselect
                (1000baseT <full-duplex,flag0,flag1,flag2>)
         status: active

# pciconf -lbv
vge0_at_pci0:2:0:0:        class=0x020000 card=0x01101106 chip=0x31191106
rev=0x82 hdr=0x00
     vendor     = 'VIA Technologies Inc'
     device     = ''Velocity' Gigabit Ethernet Controllers
(VT6120/VT6121/VT6122)'
     class      = network
     subclass   = ethernet
     bar   [10] = type I/O Port, range 32, base 0xec00, size 256, enabled
     bar   [14] = type Memory, range 64, base 0xdf7ff000, size 256, enabled

# vmstat -i
interrupt                          total       rate
irq28: vge0                       328436         23

     **** references ****

[1] VIA ARTiGO A2000 is a storage-oriented compact barebone PC
-> http://www.via.com.tw/en/products/embedded/artigo/a2000/

[2] Dovecot Secure IMAP server, version 1.1.15
-> http://www.dovecot.org/

[3] Mozilla's Thunderbird email application, version 2.0.0.21 (20090302)
-> http://www.mozillamessaging.com/en-US/thunderbird/

[4] run Thunderbird in debug mode
set NSPR_LOG_MODULES=IMAP:5
set NSPR_LOG_FILE=C:\thunderbird.txt
start /d "C:\Program Files\Mozilla Thunderbird\" thunderbird.exe
-> http://wiki.Dovecot.org/Debugging/Thunderbird

[5] Convenient FreeBSD Administration Toolkit
-> http://people.freebsd.org/~rse/adm/

[6] NanoBSD Howto
-> http://www.freebsd.org/doc/en_US.ISO8859-1/articles/nanobsd/

[7] Memory Diagnostic
-> http://www.memtest86.com/memtest86-3.5.iso.zip

     **** related ****

No 1000baseTX on VIA Artigo A2000
-> http://apps.sourceforge.net/phpbb/freenas/viewtopic.php?f=9&t=851

kern/130846: [vge] vge0 not autonegotiating to 1000baseTX full duplex in 7.1
-> http://www.freebsd.org/cgi/query-pr.cgi?pr=130846

FreeNAS on the ARTiGO A2000
-> http://www.logicsupply.com/blog/2008/12/29/freenas-on-the-artigo-a2000/

-- 
http://thomas.lotterer.net
Received on Tue Jun 09 2009 - 23:39:55 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:49 UTC