CURRENT/11-STABLE: Realtek NICs crash FreeBSD

From: O. Hartmann <ohartman_at_zedat.fu-berlin.de>
Date: Sat, 8 Oct 2016 12:59:33 +0200
I will start with a short problem description for the impatient,
afterwards I'll describe the situation in more details.

Running 11-CURRENT, 11-STABLE and now 12-CURRENT on hosts equipted with
Realtek NIC chipsets bring the system down (crash) on a reproduciable
manner. Plugging and unplugging the network cable is one method, having
a more sophisticated switch with green power management does the same,
but in an unpredictable way.

Having the hosts attached directly to a "smart" switch, the crashes can
be reproduced by plugging and unplugging the cord or having some
traffic - then, it seems to me from an observers point of view, the
switch does some arbitrary stuff like link up / link down or power
saving or something I can't check and the systems are going down anyway.
Having a dumb switch as intermediate device, like the Netgeas GS105, a
5 port GBit switch, the connection is stable as long the cabling is
untouched.

The problems occur also on my private Netgear GS110TPv2 8-port GBit "smart maneged"
switch, also in "unsmart" mode (means: Eco mode off, no sophiticated stuff enabled, no
powersaving/short cabling enabled, no snmp traps and so on, just factory settings)

Switching on some Eco mode facilities (powersavings,short cable etc.) brings the hosts in
question down really rapidly even with cabling untouched.

The NICs in question are:

1) Server/Host
from dmesg:
rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on
miibus0 rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow,
100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT,
1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master,
1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow

from pciconf -lvceb

re0_at_pci0:5:0:0: class=0x020000 card=0x81681849 chip=0x816810ec rev=0x06
hdr=0x00 vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet
Controller' class      = network
    subclass   = ethernet
    bar   [10] = type I/O Port, range 32, base 0xd000, size 256, enabled
    bar   [18] = type Prefetchable Memory, range 64, base 0xf2104000,
size 4096, enabled bar   [20] = type Prefetchable Memory, range 64,
base 0xf2100000, size 16384, enabled cap 01[40] = powerspec 3  supports
D0 D1 D2 D3  current D0 cap 05[50] = MSI supports 1 message, 64 bit 
    cap 10[70] = PCI-Express 2 endpoint MSI 1 max data 128(128)
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)
    cap 11[b0] = MSI-X supports 4 messages, enabled
                 Table in map 0x20[0x0], PBA in map 0x20[0x800]
    cap 03[d0] = VPD
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 0 corrected
    ecap 0002[140] = VC 1 max VC0
    ecap 0003[160] = Serial 1 01000000684ce000

2)
The second system, A Lenovo Laptop E540, has 

re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port
0x3000-0x30ff mem 0xf0d04000-0xf0d04fff,0xf0d00000-0xf0d03fff at device
0.0 on pci2 re0: Using 1 MSI-X message re0: ASPM disabled
re0: Chip rev. 0x50800000
re0: MAC rev. 0x00100000
miibus0: <MII bus> on re0
rgephy0: <RTL8251 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX,
100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master,
1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow re0:
Using defaults for TSO: 65518/35/2048 re0: Ethernet address:
28:d2:44:79:87:32 re0: netmap queues/slots: TX 1/256, RX 1/256

and from pciconf:

re0_at_pci0:3:0:0: class=0x020000 card=0x502817aa chip=0x816810ec rev=0x10
hdr=0x00 vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet
Controller' class      = network
    subclass   = ethernet
    cap 01[40] = powerspec 3  supports D0 D1 D2 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit 
    cap 10[70] = PCI-Express 2 endpoint MSI 1 max data 128(128) RO
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)
    cap 11[b0] = MSI-X supports 4 messages, enabled
                 Table in map 0x20[0x0], PBA in map 0x20[0x800]
    cap 03[d0] = VPD
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
    ecap 0002[140] = VC 1 max VC0
    ecap 0003[160] = Serial 1 01000000684ce000
    ecap 0018[170] = LTR 1
    ecap 001e[178] = unknown 1



The longer story:

As described above, The problem seems to be with realtek chips I have only. The
Host/server box is also equipted with an Intel NIC and the problem doesn't occur. This
specific host also has the same Realtek NIC as the crashing host (it's a crappy ASROCK
board, sorry).

At the campus lab, I realised that on the laptop plugging and unplugging the wired LAN
brought down the system very quickly - that was with 11-CURRENT a couple of months ago,
that was intermediate with 11-STABLE the case and it is now with 12-CURRENT the case
(recent update, all boxes have FreeBSD 12.0-CURRENT #3 r306839: Sat Oct  8 11:16:48 CEST
2016). Since this laptop also has an Intel WiFi i7260 device which had severe problems in
the past (iwm driver), I did not pay much attention to the wire net problem - CURRENT
always has some issues, so, I tried not to plug and unplug while the system is running.

Now at the lab/office, and at "home" the network infrastructure has changed, CISCO or
HP switches as the backbone infrastructure at the lab (I do not know the office's
infrastructure) and Netgear GS110TPv2 smart managed switch seems to cause more trouble as
anticipated. With both hosts attached to the GS110TPv2 and some Eco mode available, the
systems go down predictable. Loggin in on the interface of the switch is also a deadly
mission. Leaving the switch in factory settings untouched, pluggin/unplugging is also a
deadly force to FreeBSD. Or simply waiting some time - while I do not know what the
switch is doing then - the systems crash. 

At the moment, the systems with Realtek NICs (3) are unusable with this smart switch and,
as a result of my observation today after the GS110TPv2 got installed, problems with
other switches as well. I do not think its a problem with the switch, but some switches
seem to perform actions bringing down FreeBSD on a predictable manner (Eco
mode/powersaving) or even unplugging the cabling.

I start now changing NICs to separate Intel based ones to get rid of this Realtek crap.
So the only debugging capable device left will be the laptop and I'd appreciate some
tipps for giving you some more informations. Right now, I do not have crashdumps or
screenshots - the laptop shows some informations, but the vanish fast. I'll configure as
soon as possible a debugging kernel.

Kind regards,
oh

Received on Sat Oct 08 2016 - 08:59:42 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC