Re: "New" ZFS crash on FS (pool?) unmount/export

From: Thomas Backman <serenity_at_exscape.org> Date: Wed, 24 Jun 2009 21:23:10 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:50 UTC

>
> On Sat, Jun 20, 2009 at 12:11 AM, Thomas  
> Backman<serenity_at_exscape.org> wrote:
>> I just ran into this tonight. Not sure exactly what triggered it -  
>> the box
>> stopped responding to pings at 02:07AM and it has a cron backup job  
>> using
>> zfs send/recv at 02:00, so I'm guessing it's related, even though  
>> the backup
>> probably should have finished before then... Hmm. Anyway.
>>
>> r194478.
>>
>> kernel trap 12 with interrupts disabled
>>
>> Fatal trap 12: page fault while in kernel mode
>> cpuid = 0; apic id = 00
>> fault virtual address   = 0x288
>> fault code              = supervisor read data, page not present
>> instruction pointer     = 0x20:0xffffffff805a4989
>> stack pointer           = 0x28:0xffffff803e8b57e0
>> frame pointer           = 0x28:0xffffff803e8b5840
>> code segment            = base 0x0, limit 0xfffff, type 0x1b
>>                        = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags        = resume, IOPL = 0
>> current process         = 57514 (zpool)
>> panic: from debugger
>> cpuid = 0
>> Uptime: 10h22m13s
>> Physical memory: 2027 MB
>>
>> (kgdb) bt
>> #0  doadump () at pcpu.h:223
>> #1  0xffffffff8059c409 in boot (howto=260) at
>> /usr/src/sys/kern/kern_shutdown.c:419
>> #2  0xffffffff8059c85c in panic (fmt=Variable "fmt" is not available.
>> ) at /usr/src/sys/kern/kern_shutdown.c:575
>> #3  0xffffffff801f1377 in db_panic (addr=Variable "addr" is not  
>> available.
>> ) at /usr/src/sys/ddb/db_command.c:478
>> #4  0xffffffff801f1781 in db_command (last_cmdp=0xffffffff80c38620,
>> cmd_table=Variable "cmd_table" is not available.
>> ) at /usr/src/sys/ddb/db_command.c:445
>> #5  0xffffffff801f19d0 in db_command_loop () at
>> /usr/src/sys/ddb/db_command.c:498
>> #6  0xffffffff801f3969 in db_trap (type=Variable "type" is not  
>> available.
>> ) at /usr/src/sys/ddb/db_main.c:229
>> #7  0xffffffff805ce465 in kdb_trap (type=12, code=0,  
>> tf=0xffffff803e8b5730)
>> at /usr/src/sys/kern/subr_kdb.c:534
>> #8  0xffffffff8088715d in trap_fatal (frame=0xffffff803e8b5730,  
>> eva=Variable
>> "eva" is not available.
>> ) at /usr/src/sys/amd64/amd64/trap.c:847
>> #9  0xffffffff80887fb2 in trap (frame=0xffffff803e8b5730) at
>> /usr/src/sys/amd64/amd64/trap.c:345
>> #10 0xffffffff8086e007 in calltrap () at
>> /usr/src/sys/amd64/amd64/exception.S:223
>> #11 0xffffffff805a4989 in _sx_xlock_hard (sx=0xffffff0043557d50,
>> tid=18446742975830720512, opts=Variable "opts" is not available.
>> )
>>    at /usr/src/sys/kern/kern_sx.c:575
>> #12 0xffffffff805a52fe in _sx_xlock (sx=Variable "sx" is not  
>> available.
>> ) at sx.h:155
>> #13 0xffffffff80fe2995 in zfs_freebsd_reclaim () from /boot/kernel/ 
>> zfs.ko
>> #14 0xffffffff808cefca in VOP_RECLAIM_APV (vop=0xffffff0043557d38,
>> a=0xffffff0043557d50) at vnode_if.c:1926
>> #15 0xffffffff80626f6e in vgonel (vp=0xffffff00437a7938) at  
>> vnode_if.h:830
>> #16 0xffffffff8062b528 in vflush (mp=0xffffff0060f2a000, rootrefs=0,
>> flags=0, td=0xffffff0061528000)
>>    at /usr/src/sys/kern/vfs_subr.c:2450
>> #17 0xffffffff80fdd3a8 in zfs_umount () from /boot/kernel/zfs.ko
>> #18 0xffffffff8062420a in dounmount (mp=0xffffff0060f2a000,
>> flags=1626513408, td=Variable "td" is not available.
>> )
>>    at /usr/src/sys/kern/vfs_mount.c:1287
>> #19 0xffffffff80624975 in unmount (td=0xffffff0061528000,
>> uap=0xffffff803e8b5c00)
>>    at /usr/src/sys/kern/vfs_mount.c:1172
>> #20 0xffffffff8088783f in syscall (frame=0xffffff803e8b5c90) at
>> /usr/src/sys/amd64/amd64/trap.c:984
>> #21 0xffffffff8086e290 in Xfast_syscall () at
>> /usr/src/sys/amd64/amd64/exception.S:364
>> #22 0x000000080104e49c in ?? ()
>> Previous frame inner to this frame (corrupt stack?)
>>
>> BTW, I got a (one) "force unmount is experimental" on the console. On
>> regular shutdown I usually get one per filesystem, it seems (at  
>> least 10)
>> and this pool should contain exactly as many filesystems as the  
>> root pool
>> since it's a copy of it. On running the backup script manually post- 
>> crash,
>> though, I didn't get any.
>>
>> Also worth noting is that I was running DTrace all night to test its
>> stability. I'm pretty sure the script was
>> dtrace -n 'syscall::open:entry { _at_a[copyinstr(arg0)] = count(); }'
>>
>> 0 swap was used and 277700 pages (~1084 MB or 50%) RAM was free,  
>> according
>> to the core.txt.
>>
>> Regards,
>> Thomas
>>
On Jun 20, 2009, at 10:23 AM, Kip Macy wrote:

> A force unmount causes all vnodes dirty data to be flushed and then
> the vnodes to be released. A regular unmount will not complete if
> there are open files referencing it (hence the use of force unmount on
> shutdown).  Pawel had previously disabled forced unmount because of
> known "issues". I chose to enable it because it guarantees that dirty
> data is flushed to disk before shutdown. It looks like we may be
> reclaiming a vnode that has already been freed. I haven't seen this
> issue myself, so if you can identify more specifics on how to
> reproduce it would greatly increase the likelihood of my being able to
> fix it in the near future.
>
>
> Thanks,
> Kip

(First off, sorry if the quoting is screwed up.)

I got the same panic again, same backtrace, and the exact same point  
in the backup script (and with a force unmount message as well, which  
as I said before I've never gotten during backups before, only during  
shutdown).

[...]
receiving incremental stream of tank/test/fs15935_at_backup-20090624-2054  
into slave/test/fs15935_at_backup-20090624-2054
received 312B stream in 1 seconds (312B/sec)
found clone origin slave/usr/src_r194478-UNTOUCHED_at_r194478-UNTOUCHED
receiving incremental stream of tank/usr/src_at_backup-20090624-2054 into  
slave/usr/src_at_backup-20090624-2054
received 162MB stream in 68 seconds (2.38MB/sec)
receiving incremental stream of tank/usr/src_old_at_backup-20090624-2054  
into slave/usr/src_old_at_backup-20090624-2054
received 108KB stream in 1 seconds (108KB/sec)
local fs slave/usr/src does not have fromsnap (backup-20090624-0200 in  
stream); must have been deleted locally; ignoring
Exporting pool
Read from remote host 192.168.1.10: Operation timed out
Connection to 192.168.1.10 closed.

I think it has something to do with the clones. Between this backup  
one the previous, I did
zfs rename tank/usr/src tank/usr/src_old
zfs clone tank/usr/src_r194478-UNTOUCHED_at_... tank/usr/src

Then an svn update and build{world,kernel} and install, and post- 
install, run a backup and get the panic.

Unread portion of the kernel message buffer:
kernel trap 12 with interrupts disabled

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x288
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff805ad7d9
stack pointer           = 0x28:0xffffff803e9557e0
frame pointer           = 0x28:0xffffff803e955840
code segment            = base 0x0, limit 0xfffff, type 0x1b
                         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 1278 (zpool)

Locked vnodes (neat, I took a picture with a camera, didn't realize  
they'd end up here! Also, no clue if it helps at all):

0xffffff000280f3b0: tag zfs, type VDIR
     usecount 1, writecount 0, refcount 2 mountedhere 0
     flags ()
     v_object 0xffffff0012a68510 ref 0 pages 0
     lock type zfs: SHARED (count 1)

0xffffff0002ded000: tag zfs, type VDIR
     usecount 1, writecount 0, refcount 2 mountedhere 0
     flags ()
     lock type zfs: SHARED (count 1)

0xffffff002ec5c000: tag zfs, type VDIR
     usecount 1, writecount 0, refcount 1 mountedhere 0xffffff002ec3b5e0
     flags ()
     lock type zfs: EXCL by thread 0xffffff002ec14ab0 (pid 1278)

0xffffff002ecae588: tag zfs, type VDIR
     usecount 0, writecount 0, refcount 1 mountedhere 0
     flags (VI_DOOMED)
  VI_LOCKed    lock type zfs: EXCL by thread 0xffffff002ec14ab0 (pid  
1278)

0xffffff000280f3b0: tag zfs, type VDIR
     usecount 1, writecount 0, refcount 2 mountedhere 0
     flags ()
     v_object 0xffffff0012a68510 ref 0 pages 0
     lock type zfs: SHARED (count 1)

0xffffff0002ded000: tag zfs, type VDIR
     usecount 1, writecount 0, refcount 2 mountedhere 0
     flags ()
     lock type zfs: SHARED (count 1)

0xffffff002ec5c000: tag zfs, type VDIR    usecount 1, writecount 0,  
refcount 1 mountedhere 0xffffff002ec3b5e0
     flags ()
     lock type zfs: EXCL by thread 0xffffff002ec14ab0 (pid 1278)

0xffffff002ecae588: tag zfs, type VDIR
     usecount 0, writecount 0, refcount 1 mountedhere 0
     flags (VI_DOOMED)
  VI_LOCKed    lock type zfs: EXCL by thread 0xffffff002ec14ab0 (pid  
1278)
panic: from debugger
cpuid = 0
Uptime: 8m39s
Physical memory: 2027 MB
...
#10 0xffffffff80879647 in calltrap () at /usr/src/sys/amd64/amd64/ 
exception.S:223
#11 0xffffffff805ad7d9 in _sx_xlock_hard (sx=0xffffff002ea63480,  
tid=18446742974982343344, opts=Variable "opts" is not available.
)
     at /usr/src/sys/kern/kern_sx.c:575
#12 0xffffffff805ae14e in _sx_xlock (sx=Variable "sx" is not available.
) at sx.h:155
#13 0xffffffff80fe19e5 in zfs_freebsd_reclaim () from /boot/kernel/ 
zfs.ko
#14 0xffffffff808da87a in VOP_RECLAIM_APV (vop=0xffffff002ea63468,  
a=0xffffff002ea63480) at vnode_if.c:1926
#15 0xffffffff80630e7e in vgonel (vp=0xffffff002ecae588) at vnode_if.h: 
830
#16 0xffffffff80635438 in vflush (mp=0xffffff002ec3b5e0, rootrefs=0,  
flags=0, td=0xffffff002ec14ab0)
     at /usr/src/sys/kern/vfs_subr.c:2450
#17 0xffffffff80fdc3f8 in zfs_umount () from /boot/kernel/zfs.ko
#18 0xffffffff8062e11a in dounmount (mp=0xffffff002ec3b5e0,  
flags=784578016, td=Variable "td" is not available.
)
     at /usr/src/sys/kern/vfs_mount.c:1287
#19 0xffffffff8062e885 in unmount (td=0xffffff002ec14ab0,  
uap=0xffffff803e955c00)
     at /usr/src/sys/kern/vfs_mount.c:1172
#20 0xffffffff80892edf in syscall (frame=0xffffff803e955c90) at /usr/ 
src/sys/amd64/amd64/trap.c:984
#21 0xffffffff808798d0 in Xfast_syscall () at /usr/src/sys/amd64/amd64/ 
exception.S:364
#22 0x000000080104e06c in ?? ()
Previous frame inner to this frame (corrupt stack?)

Regards,
Thomas