Re: panic: Bad link elm, nfsd related?

From: Matthew West <mwest_at_l.zeeb.org>
Date: Wed, 29 Apr 2009 19:24:40 +0100
FreeBSD 8-CURRENT, built from sources around 27/02/2009:

FreeBSD foo.internal 8.0-CURRENT FreeBSD 8.0-CURRENT #5: Fri Apr 17 18:33:02 BST 2009 mwest_at_foo.internal:/usr/obj/usr/src/sys/DEBUGLOCK amd64

The system is AMD64, with 16GB of RAM, serving a few hundred clients via
NFS (v2 and v3) and Samba, from a 800GB ZFS pool; using hardware RAID
(aac controller), not RAID-Z.

Running a GENERIC kernel, but with the following options enabled:

options DEBUG_LOCKS
options DEBUG_VFS_LOCKS
options DIAGNOSTIC
options NFS_LEGACYRPC

The last option is per Rick Macklem's suggestion
(http://lists.freebsd.org/pipermail/freebsd-current/2009-March/005074.html).

While I don't think it's related, I also have Jaakko Heinonen's patch to
zfs_znode.c applied, from: http://www.freebsd.org/cgi/query-pr.cgi?pr=132068

After almost 11 days of active usage, there was a system panic.  I did
manage to get a crash dump:

----------
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
panic: Bad link elm 0xffffff00074ef400 next->prev != elm
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
panic() at panic+0x182
xprt_inactive_locked() at xprt_inactive_locked+0x78
svc_vc_rendezvous_recv() at svc_vc_rendezvous_recv+0x335
svc_run_internal() at svc_run_internal+0x347
svc_run() at svc_run+0x94
nlm_syscall() at nlm_syscall+0x826
syscall() at syscall+0x1e7
Xfast_syscall() at Xfast_syscall+0xab
--- syscall (154, FreeBSD ELF64, nlm_syscall), rip = 0x8008b7c6c, rsp = 0x7fffffffecf8, rbp = 0x7fffffffee20 ---
KDB: enter: panic
Uptime: 11d23h3m42s
Physical memory: 3056 MB
Dumping 1757 MB: 1742 1726 1710 1694 1678 1662 1646 1630 1614 1598 1582 1566 1550 1534 1518 1502 1486 1470 1454 1438 1422 1406 1390 1374 1358 1342 1326 1310 1294 1278 1262 1246 1230 1214 1198 1182 1166 1150 1134 1118 1102 1086 1070 1054 1038 1022 1006 990 974 958 942 926 910 894 878 862 846 830 814 798 782 766 750 734 718 702 686 670 654 638 622 606 590 574 558 542 526 510 494 478 462 446 430 414 398 382 366 350 334 318 302 286 270 254 238 222 206 190 174 158 142 126 110 94 78 62 46 30 14

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /boot/kernel/zfs.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /boot/kernel/opensolaris.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/kernel/pf.ko...Reading symbols from /boot/kernel/pf.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/pf.ko
#0  doadump () at pcpu.h:196
196		__asm __volatile("movq %%gs:0,%0" : "=r" (td));
(kgdb) bt
#0  doadump () at pcpu.h:196
#1  0xffffffff805428c3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:420
#2  0xffffffff80542d6c in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:576
#3  0xffffffff8071be38 in xprt_inactive_locked (xprt=Variable "xprt" is not available.
) at /usr/src/sys/rpc/svc.c:380
#4  0xffffffff8071f915 in svc_vc_rendezvous_recv (xprt=0xffffff00074ef400, msg=Variable "msg" is not available.
) at /usr/src/sys/rpc/svc_vc.c:352
#5  0xffffffff8071da17 in svc_run_internal (pool=0xffffff0007bd7600, ismaster=1) at /usr/src/sys/rpc/svc.c:787
#6  0xffffffff8071e174 in svc_run (pool=0xffffff0007bd7600) at /usr/src/sys/rpc/svc.c:1223
#7  0xffffffff8070b666 in nlm_syscall (td=Variable "td" is not available.
) at /usr/src/sys/nlm/nlm_prot_impl.c:1573
#8  0xffffffff8080bcd7 in syscall (frame=0xfffffffe9b8bec90) at /usr/src/sys/amd64/amd64/trap.c:898
#9  0xffffffff807e8e8b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:338
#10 0x00000008008b7c6c in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) list *0xffffffff8070b666
0xffffffff8070b666 is in nlm_syscall (/usr/src/sys/nlm/nlm_prot_impl.c:1577).
1572	
1573		svc_run(pool);
1574		error = 0;
1575	
1576	#ifdef NFSCLIENT
1577		nfs_advlock_p = old_nfs_advlock;
1578		nfs_reclaim_p = old_nfs_reclaim;
1579	#endif
1580	
1581	out:
(kgdb) list *0xffffffff8071e174
0xffffffff8071e174 is in svc_run (/usr/src/sys/rpc/svc.c:1225).
1220			svc_new_thread(pool);
1221		}
1222	
1223		svc_run_internal(pool, TRUE);
1224	
1225		mtx_lock(&pool->sp_lock);
1226		while (pool->sp_threadcount > 0)
1227			msleep(pool, &pool->sp_lock, 0, "svcexit", 0);
1228		mtx_unlock(&pool->sp_lock);
1229	}
----------

Any suggestions?  Should I go back to the newer RPC implementation?

Thanks,

Matthew
Received on Wed Apr 29 2009 - 16:25:57 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:46 UTC