Re: was: CURRENT [r308087] still crashing: Backtrace provided

From: O. Hartmann <ohartmann_at_walstatt.org>
Date: Sun, 6 Nov 2016 11:13:56 +0100
Am Sat, 5 Nov 2016 13:37:48 -0700
Mark Johnston <markj_at_FreeBSD.org> schrieb:

> On Sat, Nov 05, 2016 at 06:45:09PM +0100, O. Hartmann wrote:
> > Am Sun, 30 Oct 2016 11:25:09 -0700
> > Mark Johnston <markj_at_FreeBSD.org> schrieb:
> >   
> > > On Sun, Oct 30, 2016 at 06:55:00PM +0100, O. Hartmann wrote:  
> > > > Am Sun, 30 Oct 2016 09:39:34 -0700
> > > > Mark Johnston <markj_at_FreeBSD.org> schrieb:  
> > > > > Based on the stack trace and affected range of revisions, it may be that
> > > > > reverting r307887 or r307234 helps, but I have no specific evidence for
> > > > > this without the requested output.    
> > > > 
> > > > I had the crashing also with > r307300 until now, so that leaves me with
> > > > r307233 ... I will go further with that revision and report so far.     
> > > 
> > > Hm, I don't see why this excludes r307887? In any case, r307234 looks to
> > > be the more likely culprit.  
> > 
> > Here I'm again.
> > 
> > This time, it was r308329 or r308331. WITHOUT the debug stuff compiled into the
> > kernel, it took approximately 5 minutes to provoke the crash. WITH the debug options
> > set, it took more than 45 minutes to let the system dump the core. I really hope this
> > time we can fix the problem, this moment, I have put the system back to r307233 to
> > see whether 3072034 is causing the crash as you suspect.  
> 
> Sorry, I don't quite follow - are you able to provoke the crash at
> r307233? Or are you still testing that revision?

Yesterday, I ran the whole day (> 9 hours) without problems r307233 without the reported
crash.

Today's morning I got brave and tried r307234 - and had a crash within an hour.

> 
> > 
> > Attached, you'll find the backtrace report as last time. I had to type in "dump"
> > blindly on the system as a dark screen or a stuck X11 screen blocked the console (I
> > use vt() and nVidia BLOB with my nVidia GPUs - and this is still broken on FBSD).
> > 
> > Please let me know how I can assist further. I saved both the core AND this time the
> > culprit kernel.  
> 
> Great, thank you. I would first like to confirm that r307234 is indeed
> causing the crash - since it appears to be easy to trigger, that should
> be faster. If not, the core will help track down the real problem.

Although I was under the impression the in-kernel-config option

makeoptions    DEBUG=-g

would make debugging symbols available, I'm proved wrong.

I tried also on 

FreeBSD 12.0-CURRENT #15 r308329: Sat Nov  5 08:52:24 CET 2016
 
and crashed, from which I picked up kernel and vmcore as well as
the text of the backtrace as provided in an earlier mail (see below at [core.txt.0], and
if I perform this suggested command sequence:

ohartmann_at_thor [kernel_debug]: kgdb ./kernel vmcore.0 
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...(no debugging symbols found)...
Attempt to extract a component of a value that is not a structure pointer.
Attempt to extract a component of a value that is not a structure pointer.
#0  0xffffffff807b8d83 in doadump ()
(kgdb) frame 12
#12 0xffffffff80923a74 in ip_output ()
(kgdb) p *ifp
No symbol table is loaded.  Use the "file" command.
(kgdb) p *ro
No symbol table is loaded.  Use the "file" command.
(kgdb)

Again, I'm doing this kind of debugging the very first time and I miss something here,
apologizes for that.

Sorry about the redundancy.

The curious thing to me is that this bug is triggered on systems with Intel CPU
architectures older or equal than IvyBridge. The very same /etc/make.conf
and /etc/src.conf as well as the very same kernel config apart from some local hardware
dependend modifications are spread around my servers and workstations and especially my
bureau's box is a sHaswell XEON with almost the exact same confict running on CURRENT
(recent as of Thursday) without problems while the box I'm reporting this error from is
crashing (i3-3220, the server, also crashing here, is a E3-1245 V2. Another crashing
system is a 2009 C2D XEON 5XXX, two socket server, crashing the same way, but with a
different kernel config.
I tried on the crashing systems with GENERIC as well with the same results.

I'm using IPFW as the firewall on all systems.

Please tell me if you revert some commits, I'll then checkout the sources up to recent
CURRENT and try again.

This just for addition and completion.


Kind regards and thanks in advance,

Oliver

[...]
[core.txt.0]
...
Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff807b44fb
stack pointer           = 0x28:0xfffffe0238f7c290
frame pointer           = 0x28:0xfffffe0238f7c310
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 521 (nslcd)

Reading symbols from /boot/modules/nvidia-modeset.ko...done.
Loaded symbols for /boot/modules/nvidia-modeset.ko
Reading symbols from /boot/modules/nvidia.ko...done.
Loaded symbols for /boot/modules/nvidia.ko
#0  doadump (textdump=0) at pcpu.h:222
222     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) #0  doadump (textdump=0) at pcpu.h:222
#1  0xffffffff8049e1eb in db_dump (dummy=<value optimized out>, dummy2=false, 
    dummy3=0, dummy4=0x0) at /usr/src/sys/ddb/db_command.c:546
#2  0xffffffff8049dfe9 in db_command (cmd_table=<value optimized out>)
    at /usr/src/sys/ddb/db_command.c:453
#3  0xffffffff8049dd44 in db_command_loop ()
    at /usr/src/sys/ddb/db_command.c:506
#4  0xffffffff804a11af in db_trap (type=<value optimized out>, 
    code=<value optimized out>) at /usr/src/sys/ddb/db_main.c:248
#5  0xffffffff807fd3e3 in kdb_trap (type=<value optimized out>, 
    code=<value optimized out>, tf=<value optimized out>)
    at /usr/src/sys/kern/subr_kdb.c:654
#6  0xffffffff80afeaf1 in trap_fatal (frame=0xfffffe0238f7c1d0, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:796
#7  0xffffffff80afe7df in trap (frame=0xfffffe0238f7c1d0)
    at /usr/src/sys/amd64/amd64/trap.c:198
#8  0xffffffff80adf4a1 in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:236
#9  0xffffffff807b44fb in __rw_wlock_hard (c=<value optimized out>, 
    tid=<value optimized out>, file=<value optimized out>, 
    line=<value optimized out>) at /usr/src/sys/kern/kern_rwlock.c:830
#10 0xffffffff807b437c in _rw_wlock_cookie (c=0xfffff80070538310, 
    file=0xffffffff80ca31b2 "/usr/src/sys/net/if_ethersubr.c", line=304)
    at /usr/src/sys/kern/kern_rwlock.c:296
#11 0xffffffff808d1e07 in ether_output (ifp=0xfffff800036e7800, 
    m=<value optimized out>, dst=0xfffff8003d980e60, ro=0xfffff8003d980e40)
    at /usr/src/sys/net/if_ethersubr.c:304
#12 0xffffffff80923a74 in ip_output (m=0xfffff8000a24a500, 
    opt=<value optimized out>, ro=<value optimized out>, flags=0, imo=0x0, 
    inp=<value optimized out>) at /usr/src/sys/netinet/ip_output.c:664
#13 0xffffffff8099a7ee in tcp_output (tp=<value optimized out>)
    at /usr/src/sys/netinet/tcp_output.c:1432
#14 0xffffffff809a7c88 in tcp_usr_send (so=<value optimized out>, 
    flags=<value optimized out>, m=0xfffff8003d837800, nam=0x0, 
    control=<value optimized out>, td=0xfffff8000a24a500)
    at /usr/src/sys/netinet/tcp_usrreq.c:956
#15 0xffffffff808567b4 in sosend_generic (so=<value optimized out>, 
    addr=<value optimized out>, uio=<value optimized out>, 
    top=0xfffff8003d837800, control=<value optimized out>, 
    flags=<value optimized out>, td=<value optimized out>)
    at /usr/src/sys/kern/uipc_socket.c:1359
#16 0xffffffff8082d672 in soo_write (fp=<value optimized out>, 
    uio=0xfffffe0238f7c900, active_cred=<value optimized out>, 
    flags=<value optimized out>, td=<value optimized out>)
    at /usr/src/sys/kern/sys_socket.c:146
#17 0xffffffff80823d84 in dofilewrite (td=0xfffff8000a24a500, fd=7, 
    fp=0xfffff8000a0421e0, auio=0xfffffe0238f7c900, 
    offset=<value optimized out>, flags=0) at file.h:311
#18 0xffffffff80823ac8 in kern_writev (td=0xfffff8000a24a500, fd=7, 
    auio=0xfffffe0238f7c900) at /usr/src/sys/kern/sys_generic.c:508
#19 0xffffffff80823a54 in sys_write (td=0xfffff800705382f8, 
    uap=<value optimized out>) at /usr/src/sys/kern/sys_generic.c:421
#20 0xffffffff80aff33f in amd64_syscall (td=0xfffff8000a24a500, 
    traced=<value optimized out>) at subr_syscall.c:135
#21 0xffffffff80adf78b in Xfast_syscall ()
    at /usr/src/sys/amd64/amd64/exception.S:396
#22 0x0000000801261f5a in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
(kgdb) 
[...]
-- 
O. Hartmann

Ich widerspreche der Nutzung oder Übermittlung meiner Daten für
Werbezwecke oder für die Markt- oder Meinungsforschung (§ 28 Abs. 4 BDSG).

Received on Sun Nov 06 2016 - 09:14:13 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC