I will start with a short problem description for the impatient, afterwards I'll describe the situation in more details. Running 11-CURRENT, 11-STABLE and now 12-CURRENT on hosts equipted with Realtek NIC chipsets bring the system down (crash) on a reproduciable manner. Plugging and unplugging the network cable is one method, having a more sophisticated switch with green power management does the same, but in an unpredictable way. Having the hosts attached directly to a "smart" switch, the crashes can be reproduced by plugging and unplugging the cord or having some traffic - then, it seems to me from an observers point of view, the switch does some arbitrary stuff like link up / link down or power saving or something I can't check and the systems are going down anyway. Having a dumb switch as intermediate device, like the Netgeas GS105, a 5 port GBit switch, the connection is stable as long the cabling is untouched. The problems occur also on my private Netgear GS110TPv2 8-port GBit "smart maneged" switch, also in "unsmart" mode (means: Eco mode off, no sophiticated stuff enabled, no powersaving/short cabling enabled, no snmp traps and so on, just factory settings) Switching on some Eco mode facilities (powersavings,short cable etc.) brings the hosts in question down really rapidly even with cabling untouched. The NICs in question are: 1) Server/Host from dmesg: rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0 rgephy0: none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow from pciconf -lvceb re0_at_pci0:5:0:0: class=0x020000 card=0x81681849 chip=0x816810ec rev=0x06 hdr=0x00 vendor = 'Realtek Semiconductor Co., Ltd.' device = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller' class = network subclass = ethernet bar [10] = type I/O Port, range 32, base 0xd000, size 256, enabled bar [18] = type Prefetchable Memory, range 64, base 0xf2104000, size 4096, enabled bar [20] = type Prefetchable Memory, range 64, base 0xf2100000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D1 D2 D3 current D0 cap 05[50] = MSI supports 1 message, 64 bit cap 10[70] = PCI-Express 2 endpoint MSI 1 max data 128(128) link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1) cap 11[b0] = MSI-X supports 4 messages, enabled Table in map 0x20[0x0], PBA in map 0x20[0x800] cap 03[d0] = VPD ecap 0001[100] = AER 1 0 fatal 0 non-fatal 0 corrected ecap 0002[140] = VC 1 max VC0 ecap 0003[160] = Serial 1 01000000684ce000 2) The second system, A Lenovo Laptop E540, has re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0x3000-0x30ff mem 0xf0d04000-0xf0d04fff,0xf0d00000-0xf0d03fff at device 0.0 on pci2 re0: Using 1 MSI-X message re0: ASPM disabled re0: Chip rev. 0x50800000 re0: MAC rev. 0x00100000 miibus0: <MII bus> on re0 rgephy0: <RTL8251 1000BASE-T media interface> PHY 1 on miibus0 rgephy0: none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow re0: Using defaults for TSO: 65518/35/2048 re0: Ethernet address: 28:d2:44:79:87:32 re0: netmap queues/slots: TX 1/256, RX 1/256 and from pciconf: re0_at_pci0:3:0:0: class=0x020000 card=0x502817aa chip=0x816810ec rev=0x10 hdr=0x00 vendor = 'Realtek Semiconductor Co., Ltd.' device = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller' class = network subclass = ethernet cap 01[40] = powerspec 3 supports D0 D1 D2 D3 current D0 cap 05[50] = MSI supports 1 message, 64 bit cap 10[70] = PCI-Express 2 endpoint MSI 1 max data 128(128) RO link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1) cap 11[b0] = MSI-X supports 4 messages, enabled Table in map 0x20[0x0], PBA in map 0x20[0x800] cap 03[d0] = VPD ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected ecap 0002[140] = VC 1 max VC0 ecap 0003[160] = Serial 1 01000000684ce000 ecap 0018[170] = LTR 1 ecap 001e[178] = unknown 1 The longer story: As described above, The problem seems to be with realtek chips I have only. The Host/server box is also equipted with an Intel NIC and the problem doesn't occur. This specific host also has the same Realtek NIC as the crashing host (it's a crappy ASROCK board, sorry). At the campus lab, I realised that on the laptop plugging and unplugging the wired LAN brought down the system very quickly - that was with 11-CURRENT a couple of months ago, that was intermediate with 11-STABLE the case and it is now with 12-CURRENT the case (recent update, all boxes have FreeBSD 12.0-CURRENT #3 r306839: Sat Oct 8 11:16:48 CEST 2016). Since this laptop also has an Intel WiFi i7260 device which had severe problems in the past (iwm driver), I did not pay much attention to the wire net problem - CURRENT always has some issues, so, I tried not to plug and unplug while the system is running. Now at the lab/office, and at "home" the network infrastructure has changed, CISCO or HP switches as the backbone infrastructure at the lab (I do not know the office's infrastructure) and Netgear GS110TPv2 smart managed switch seems to cause more trouble as anticipated. With both hosts attached to the GS110TPv2 and some Eco mode available, the systems go down predictable. Loggin in on the interface of the switch is also a deadly mission. Leaving the switch in factory settings untouched, pluggin/unplugging is also a deadly force to FreeBSD. Or simply waiting some time - while I do not know what the switch is doing then - the systems crash. At the moment, the systems with Realtek NICs (3) are unusable with this smart switch and, as a result of my observation today after the GS110TPv2 got installed, problems with other switches as well. I do not think its a problem with the switch, but some switches seem to perform actions bringing down FreeBSD on a predictable manner (Eco mode/powersaving) or even unplugging the cabling. I start now changing NICs to separate Intel based ones to get rid of this Realtek crap. So the only debugging capable device left will be the laptop and I'd appreciate some tipps for giving you some more informations. Right now, I do not have crashdumps or screenshots - the laptop shows some informations, but the vanish fast. I'll configure as soon as possible a debugging kernel. Kind regards, oh
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC