Re: CURRENT: re(4) crashing system

From: O. Hartmann <ohartman_at_zedat.fu-berlin.de>
Date: Sun, 20 Nov 2016 10:03:35 +0100
Am Sun, 20 Nov 2016 16:43:52 +0900
YongHyeon PYUN <pyunyh_at_gmail.com> schrieb:

> On Sat, Nov 19, 2016 at 07:44:35PM +0100, O. Hartmann wrote:
> > Am Mon, 7 Nov 2016 11:16:23 +0900
> > YongHyeon PYUN <pyunyh_at_gmail.com> schrieb:
> >   
> > > On Sun, Nov 06, 2016 at 01:20:36PM +0100, Hartmann, O. wrote:  
> > > > On Mon, 31 Oct 2016 11:12:22 +0900
> > > > YongHyeon PYUN <pyunyh_at_gmail.com> wrote:
> > > >     
> > > > > On Fri, Oct 28, 2016 at 09:21:13PM +0200, Hartmann, O. wrote:    
> > > > > > On Thu, 27 Oct 2016 10:00:04 +0900
> > > > > > YongHyeon PYUN <pyunyh_at_gmail.com> wrote:
> > > > > >       
> > > > > > > On Tue, Oct 25, 2016 at 07:03:38AM +0200, Hartmann, O. wrote:      
> > > > > > > > On Tue, 25 Oct 2016 11:05:38 +0900
> > > > > > > > YongHyeon PYUN <pyunyh_at_gmail.com> wrote:
> > > > > > > >         
> > > > > > > 
> > > > > > > [...]
> > > > > > >       
> > > > > > > > > I'm not sure but it's likely the issue is related with
> > > > > > > > > EEE/Green Ethernet handling. EEE is negotiated feature with
> > > > > > > > > link partner. If you directly connect your laptop to non-EEE
> > > > > > > > > capable link partner like other re(4) box without switches
> > > > > > > > > you may be able to tell whether the issue is EEE/Green
> > > > > > > > > Ethernet related one or not.        
> > > > > > > > 
> > > > > > > > Me either since when I discovered a problem the first time with
> > > > > > > > CURRENT, that was the Friday before last week's Friday, there
> > > > > > > > was a unlucky coicidence: I got the new switch, FreeBSD
> > > > > > > > introduced a serious bug and I changed the NICs.
> > > > > > > > 
> > > > > > > > The laptop, the last in the row of re(4) equipted systems on
> > > > > > > > which I use the Realtek NIC, does well now with Green IT
> > > > > > > > technology, but crashes on plugging/unplugging - not on each
> > > > > > > > event, but at least in one of ten.        
> > > > > > > 
> > > > > > > Hmm, it seems you know how to trigger the issue. When you unplug
> > > > > > > UTP cable was there active network traffic on re(4) device?
> > > > > > > It would be helpful to know which event triggers the crash(e.g.
> > > > > > > unplugging or plugging).  And would you show me backtrace of
> > > > > > > panic?     
> > > > > > > > I guess the Green IT issue is more a unlucky guess of mine and
> > > > > > > > went hand in hand with the problem I face with CURRENT right
> > > > > > > > now on some older, Non UEFI machines.
> > > > > > > >         
> > > > > > > 
> > > > > > > Ok.
> > > > > > > 
> > > > > > > [...]      
> > > > > > > > 
> > > > > > > > As requested the informations about re0 and rgephy0 on the
> > > > > > > > laptop (Lenovo E540) 
> > > > > > > > 
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > rgephy0: <RTL8251 1000BASE-T media interface> PHY 1 on miibus0
> > > > > > > > rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow,
> > > > > > > > 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX,
> > > > > > > > 1000baseT-FDX-master, 1000baseT-FDX-flow,
> > > > > > > > 1000baseT-FDX-flow-master, auto, auto-flow
> > > > > > > > 
> > > > > > > > re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet>
> > > > > > > > port 0x3000-0x30ff mem
> > > > > > > > 0xf0d04000-0xf0d04fff,0xf0d00000-0xf0d03fff at device 0.0 on
> > > > > > > > pci2 re0: Using 1 MSI-X message re0: ASPM disabled re0: Chip
> > > > > > > > rev. 0x50800000 re0: MAC rev. 0x00100000        
> > > > > > > 
> > > > > > > This looks like 8168GU controller.
> > > > > > > 
> > > > > > > [...]
> > > > > > >       
> > > > > > > > I use options netmap in kernel config, but the problem is also
> > > > > > > > present without this option - just for the record.
> > > > > > > >         
> > > > > > > 
> > > > > > > Yup, netmap(4) has nothing to do with the crash.
> > > > > > > 
> > > > > > > Thanks.      
> > > > > > 
> > > > > > Attached, you'll find the backtrace of the crash. This time it was
> > > > > > really easy - just one pull of the LAN cabling - and we are
> > > > > > happy :-/
> > > > > > 
> > > > > > Please let me know if you need something else. I will return to
> > > > > > normal operations (disabling debugging) due to CURRENT is very
> > > > > > unstable at the moment on other hosts beyond r307157.
> > > > > >       
> > > > > 
> > > > > It seems the attachment was stripped.    
> > > > 
> > > > This time I hope I got it right!
> > > > 
> > > > Attached you'll find the latest CURRENT's backtrace on the provoked
> > > > crash (plug and unplug).
> > > > 
> > > > I also saved the kernel and coredump, so if you need me to do further
> > > > investigations,please let me know.
> > > >     
> > > 
> > > Thanks a lot for the backtrace.  This backtrace is not the one I
> > > expected and I guess the issue is related with cached route removal
> > > on interface down.  Quick looking over the code didn't reveal the
> > > cause of crash(I'm not familiar with that part code).  Probably
> > > gnn_at_ may have better idea what's going on here(CCed).
> > > 
> > > Thanks.  
> > 
> > In another thread I complained about permanent crashes on several "older" Intel
> > architectures (IvyBridge and down). It has been revealed, that
> > 
> > option FLOWTABLE
> > 
> > in the kernel, which is part of my custom kernels a long time for now, has been
> > identified as the culprit on those systems. Commenting out that special option solved
> > the problem!
> > 
> > Interestingly, also commenting out this option from the kernel config of the laptop in
> > question of this thread, I wasn't able - as of this writing - to reproduce the
> > crashes, so it might be that the same issue with FLOWTABLE has been triggered by
> > pluggin and/or unpluggin the LAN cord.
> >   
> 
> I'm not sure whether it's triggered by FLOWTABLE yet since it had
> been there for a log time.  I suspected r297225, r301217 which
> re-added route caching for TCP.  The panic you encountered
> indicates invalid access against destroyed lock which in turn
> suggests reference counting problem in lltable.
> I've CCed glebius_at_ and melifaro_at_ who are more familiar with routing
> code than me.

I understand. The problem with FLOWTABLE occured with r307234, r307233 was all right,
everything beyond on a certain type of our computers crashed then. I got this "hint" from
ae_at_ and disabled FLOWTABLE - and eveything was all right then, on ALL systems.

And as I reported: the laptop I have this plugging-unplugging problem is seemingly also
reliefed of this problem now. Maybe I misunderstand and option FLOWTABLE is only the
point of trigger of another problem. Just for the report. 

> 
> > Usually I was able to trigger the coredump after two or three rounds, this time I
> > tried it over ten times with no effect.
> > 
> > But on the contrary, the NIC of the laptop doesn't negotiate for 1 GBit/s with my
> > switch, it remains with 100 MBit/s. The switch is a Netgear GS110TP V2.
> >   
> 
> This would be a re(4) driver problem. When you see it negotiated
> 100Mbps after unplugging/plugging, would you try to negotiate with
> link partner again like below and let me know the result?
> 
> #ifconfig re0 media auto
> 
> Does the behavior change if you physically unplug/plug UTP cable on
> laptop rather than forcing port down/up on the switch?

I physically plug and unplug the LAN cord to test it. I use on that specific
port/laptop/switchport one of these new fancy flat ribbon cables, supposedly to be
capable of GBit/CAT6. Well, using a usual on (stiff, traditional) seems to solve this
"problem" - physics and fancy seems to be mutual exclusive ;-)

> 
> Thanks.

Kind regards,
Oliver
[...]
Received on Sun Nov 20 2016 - 08:03:40 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC