Re: CURRENT: re(4) crashing system

From: Hartmann, O. <ohartman_at_zedat.fu-berlin.de>
Date: Tue, 25 Oct 2016 07:03:38 +0200
On Tue, 25 Oct 2016 11:05:38 +0900
YongHyeon PYUN <pyunyh_at_gmail.com> wrote:

> On Mon, Oct 24, 2016 at 02:03:37PM +0200, O. Hartmann wrote:
> > On Mon, 24 Oct 2016 14:14:00 +0900
> > YongHyeon PYUN <pyunyh_at_gmail.com> wrote:
> >   
> > > On Sun, Oct 23, 2016 at 01:25:38PM +0200, Hartmann, O. wrote:  
> > > > I tried to report earlier here that CURRENT does have some
> > > > serious problems right now and one of those problems seems to
> > > > be triggered by the recent re(4) driver. The problem is also
> > > > present in recen 11-STABLE!
> > > > 
> > > > Below, you'll find pciconf-output reagrding the device on a
> > > > Lenovo E540 Laptop I can test on and trigger the problem.
> > > > 
> > > > The phenomenon is that this NIC does not negotiate 1000baseTX,
> > > > it is always falling back to 100baseTX although the device
> > > > claims to be a 1 GBit capable device.
> > > > 
> > > > When I try to put the device manually into 1000basTX mode via
> > > > 
> > > > ifconfig re0 media 1000baseTX mediaopt full-duplex (with re(4)
> > > > driver)
> > > > 
> > > > it is possible to crash the system. The system also crashes when
> > > > plugging/unplugging the LAN cord - I guess the renegotiation is
> > > > triggering this crash immediately.
> > > > 
> > > > I tried with several switches and routers capable of 1 GBit and
> > > > it seems to be independent from the network hardware in use.
> > > > 
> > > > I tried to capture a backtrace when the kernel crashes, but I
> > > > do not know how to save the the kernel debugger output.
> > > > Although I configured according the handbook debugging, there
> > > > is no coredump at all.
> > > > 
> > > > Advice is appreciated - if anybody is interesetd in solving
> > > > this. 
> > > 
> > > There were several instability reports on re(4).  I vaguely guess
> > > it would be related with some missing initializations for certain
> > > controllers.  Unfortunately, there is no publicly available
> > > datasheet for those controllers and it's not likely to get access
> > > to it in near future.  It seems vendor's FreeBSD driver accesses
> > > lots of magic registers as well as loading DSP fixups.  I have no
> > > idea what it wants to do and re(4) used to heavily rely on
> > > power-on default register values.  Engineering samples I have do
> > > not show instabilities so it wouldn't be easy to identify the
> > > issue.
> > > 
> > > Probably the first step to address the issue would be identifying
> > > those chips and narrowing down the scope of guessing.  Would you
> > > show me the dmesg output(re(4) and regphy(4) only)?  pciconf(8)
> > > output is useless here since RealTek uses the same PCI id for
> > > PCIe variants.
> > > 
> > > BTW, I was told that the vendor's FreeBSD driver seems to work
> > > fine for normal usage pattern.  The vendor's driver triggered an
> > > instant panic and lacked H/W offloading features in the past.  It
> > > might have changed though.  
> > 
> > The problemacy with re(4) drivers arose again, when I bought some
> > "green" equipment, mainly switches, which reduces power emission on
> > short cables or non-connected ports. This brought down some servers
> > with re(4) chipsets immediately and I had no clue what happend. I
> > do not know whether this is a  
> 
> I'm not sure but it's likely the issue is related with EEE/Green
> Ethernet handling. EEE is negotiated feature with link partner. If
> you directly connect your laptop to non-EEE capable link partner
> like other re(4) box without switches you may be able to tell
> whether the issue is EEE/Green Ethernet related one or not.

Me either since when I discovered a problem the first time with
CURRENT, that was the Friday before last week's Friday, there was a
unlucky coicidence: I got the new switch, FreeBSD introduced a serious
bug and I changed the NICs.

The laptop, the last in the row of re(4) equipted systems on which I
use the Realtek NIC, does well now with Green IT technology, but
crashes on plugging/unplugging - not on each event, but at least in one
of ten.
I guess the Green IT issue is more a unlucky guess of mine and went
hand in hand with the problem I face with CURRENT right now on some
older, Non UEFI machines.

> 
> > single fate so to speak, or this problem will arise for others,
> > too. We exchanged on serving hardware all Realtek NICs with those
> > from Intel, and luckily some server mainboards already have Intel
> > PHY or NICs. The Broadcom devices we have on some older Fujitus
> > hardware is also stable like a charme, even with the new power
> > saving switches. 
> 
> bge(4) also lacks EEE support(Publicly available datasheet is too
> sanitized one).  bge(4) firmware probably does not announce EEE
> capability by default in link establishment while recent re(4)
> devices seem to unconditionally announce EEE.  Generally EEE
> handling requires a kind of handshake for link state change from
> MAC/PHY.
> 
> > While we can swap on server or workstation platforms the NIC, it is
> > almost impossible on laptops and the number of laptops with realtek
> > chips seems to grow. It is a pity that the venodr of the chipsets
> > reject supporting other OSes than Windows - or in some rare cases
> > only Linux. After you wrote the answer, I checked on the net who's
> > suiatble drivers and the situation seems bad for almost all OSes
> > apart from commercial ones like Windooze and Apple OS X.
> > 
> > As soon as I get hands on the laptop again, I'll send the requested
> > informations. I know that I played around with re(4) and rgephy(4)
> > in the kernel, the rgephy(4) showed up on the dmesg, but I didn't
> > see any effect - except that it offered some additional "media
> > xxx-options-xxx" mostly appended with "flow" - but rying brought
> > also down the system as pluggin or unplugging.  
> 
> rgephy(4) will show recognized PHY H/W model. Another information
> I'd like to know is OUI information of the PHY.  The OUI
> information could be get with `devinfo -rv | grep rgephy`.
> 
> The "flow" output of media indicates it negotiated ethernet
> flow-control with link partner.  rgephy(4) used to announce
> autonegotiation even when manual setting is requested with
> ifconfig.  It was to workaround HW issues seen in the past.  
> You can disable the use of autonegotiation in manual media
> selection with flag0 option. See rgephy(4) for more information.
> Not sure whether that option helps though.
> 
> > The last kernel I compiled was then without rgephy(4) - the NIC
> > worked as expected, but pluggin/unplugging or having some
> > power-down activities on a Netgear SoHo green-pwer switch brings
> > the system down as usual.   
> 
> If you use re(4) without rgephy(4) it will use ukphy(4) which is
> completely dumb on link state detection of re(4) controller. Link
> state detection requires non-PHY register access on re(4) so using
> ukphy(4) is not recommended.

As requested the informations about re0 and rgephy0 on the laptop
(Lenovo E540) 

[...]

rgephy0: <RTL8251 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX,
100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master,
1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow

re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port
0x3000-0x30ff mem 0xf0d04000-0xf0d04fff,0xf0d00000-0xf0d03fff at device
0.0 on pci2 re0: Using 1 MSI-X message re0: ASPM disabled
re0: Chip rev. 0x50800000
re0: MAC rev. 0x00100000
miibus0: <MII bus> on re0
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: 28:d2:44:79:87:32
re0: netmap queues/slots: TX 1/256, RX 1/256
re0: link state changed to DOWN
re0: link state changed to UP

[...]

I use options netmap in kernel config, but the problem is also present
without this option - just for the record.

Kind regards,

oh
Received on Tue Oct 25 2016 - 03:03:42 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC