Re: "New" ZFS crash on FS (pool?) unmount/export

From: Thomas Backman <serenity_at_exscape.org>
Date: Thu, 9 Jul 2009 14:36:19 +0200
On Jul 9, 2009, at 11:01, Thomas Backman wrote:

> On Jun 20, 2009, at 09:11, Thomas Backman wrote:
>
>> I just ran into this tonight. Not sure exactly what triggered it -  
>> the box stopped responding to pings at 02:07AM and it has a cron  
>> backup job using zfs send/recv at 02:00, so I'm guessing it's  
>> related, even though the backup probably should have finished  
>> before then... Hmm. Anyway.
>>
>> r194478.
>>
>> kernel trap 12 with interrupts disabled
>>
>> Fatal trap 12: page fault while in kernel mode
>> cpuid = 0; apic id = 00
>> fault virtual address   = 0x288
>> fault code              = supervisor read data, page not present
>> instruction pointer     = 0x20:0xffffffff805a4989
>> stack pointer           = 0x28:0xffffff803e8b57e0
>> frame pointer           = 0x28:0xffffff803e8b5840
>> code segment            = base 0x0, limit 0xfffff, type 0x1b
>>                       = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags        = resume, IOPL = 0
>> current process         = 57514 (zpool)
>> panic: from debugger
>> cpuid = 0
>> Uptime: 10h22m13s
>> Physical memory: 2027 MB
>>
>> (kgdb) bt
>> #0  doadump () at pcpu.h:223
>> #1  0xffffffff8059c409 in boot (howto=260) at /usr/src/sys/kern/ 
>> kern_shutdown.c:419
>> #2  0xffffffff8059c85c in panic (fmt=Variable "fmt" is not available.
>> ) at /usr/src/sys/kern/kern_shutdown.c:575
>> #3  0xffffffff801f1377 in db_panic (addr=Variable "addr" is not  
>> available.
>> ) at /usr/src/sys/ddb/db_command.c:478
>> #4  0xffffffff801f1781 in db_command (last_cmdp=0xffffffff80c38620,  
>> cmd_table=Variable "cmd_table" is not available.
>> ) at /usr/src/sys/ddb/db_command.c:445
>> #5  0xffffffff801f19d0 in db_command_loop () at /usr/src/sys/ddb/ 
>> db_command.c:498
>> #6  0xffffffff801f3969 in db_trap (type=Variable "type" is not  
>> available.
>> ) at /usr/src/sys/ddb/db_main.c:229
>> #7  0xffffffff805ce465 in kdb_trap (type=12, code=0,  
>> tf=0xffffff803e8b5730) at /usr/src/sys/kern/subr_kdb.c:534
>> #8  0xffffffff8088715d in trap_fatal (frame=0xffffff803e8b5730,  
>> eva=Variable "eva" is not available.
>> ) at /usr/src/sys/amd64/amd64/trap.c:847
>> #9  0xffffffff80887fb2 in trap (frame=0xffffff803e8b5730) at /usr/ 
>> src/sys/amd64/amd64/trap.c:345
>> #10 0xffffffff8086e007 in calltrap () at /usr/src/sys/amd64/amd64/ 
>> exception.S:223
>> #11 0xffffffff805a4989 in _sx_xlock_hard (sx=0xffffff0043557d50,  
>> tid=18446742975830720512, opts=Variable "opts" is not available.
>> )
>>   at /usr/src/sys/kern/kern_sx.c:575
>> #12 0xffffffff805a52fe in _sx_xlock (sx=Variable "sx" is not  
>> available.
>> ) at sx.h:155
>> #13 0xffffffff80fe2995 in zfs_freebsd_reclaim () from /boot/kernel/ 
>> zfs.ko
>> #14 0xffffffff808cefca in VOP_RECLAIM_APV (vop=0xffffff0043557d38,  
>> a=0xffffff0043557d50) at vnode_if.c:1926
>> #15 0xffffffff80626f6e in vgonel (vp=0xffffff00437a7938) at  
>> vnode_if.h:830
>> #16 0xffffffff8062b528 in vflush (mp=0xffffff0060f2a000,  
>> rootrefs=0, flags=0, td=0xffffff0061528000)
>>   at /usr/src/sys/kern/vfs_subr.c:2450
>> #17 0xffffffff80fdd3a8 in zfs_umount () from /boot/kernel/zfs.ko
>> #18 0xffffffff8062420a in dounmount (mp=0xffffff0060f2a000,  
>> flags=1626513408, td=Variable "td" is not available.
>> )
>>   at /usr/src/sys/kern/vfs_mount.c:1287
>> #19 0xffffffff80624975 in unmount (td=0xffffff0061528000,  
>> uap=0xffffff803e8b5c00)
>>   at /usr/src/sys/kern/vfs_mount.c:1172
>> #20 0xffffffff8088783f in syscall (frame=0xffffff803e8b5c90) at / 
>> usr/src/sys/amd64/amd64/trap.c:984
>> #21 0xffffffff8086e290 in Xfast_syscall () at /usr/src/sys/amd64/ 
>> amd64/exception.S:364
>> #22 0x000000080104e49c in ?? ()
>> Previous frame inner to this frame (corrupt stack?)
>
>
> Here's the script in its relevant entirety [...]
>
> #!/bin/bash
>
> PATH="$PATH:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/ 
> sbin"
>
> function die() {
> 	echo "$_at_" 2>&1
> 	zpool export slave 2>&1 > /dev/null
> 	ggatel destroy -u 666 2>&1 > /dev/null
> 	exit 1
> }
>
> function mount_unmount {
> 	if [ -z "$1" ]; then
> 		die 'Invalid argument given to mount_unmount'
> 	elif [[ "$1" == "mount" ]]; then
> 		zpool list | grep -q slave
> 		if [ "$?" = "0" ]; then
> 			echo Already mounted
> 			return 0
> 		fi
> 		echo Creating ggate device
> 		ggatel create -u 666 /mnt/backup/chaos/slavefile || die 'Unable to  
> create GEOM provider from file'
> 		echo 'Sleeping for 5 seconds...'
> 		sleep 5
> 		echo Importing pool
> 		zpool import -R /slave slave || die 'Unable to import slave pool'
> 	elif [[ "$1" == "unmount" ]]; then
> 		echo Exporting pool
> 		zpool export slave || die 'Unable to export slave pool'
> 		ggatel destroy -u 666 || die 'Unable to destroy GEOM provider'
> 	fi
> }
>
> f [ ! -z "$1" ]; then
> 	case $1 in
> 	mount) mount_unmount mount; exit 0;;
> 	unmount) mount_unmount unmount; exit 0;;
> 	initial) initial; exit 0;;
> 	backup) ;;
> 	*) help;;
> 	esac
> else
> 	help
> fi
>
> if [ ! -f "/mnt/backup/chaos/slavefile" ]; then
> 	echo 'Backup error! slavefile does not exist!' | mail -s "Backup  
> error" root
> 	echo 'Slavefile does not exist!'
> 	exit 1
> fi
>
> mount_unmount mount
>
> CURR=$(date +"backup-%F-%H%M")
>
> echo Taking snapshots
> zfs snapshot -r tank_at_$CURR || die 'Unable to create $CURR snapshot'
>
> echo Starting backup...
> LAST=$(cat /root/.last-backup)
> zfs send -R -I $LAST tank_at_$CURR | zfs recv -Fvd slave
>
> echo $CURR > /root/.last-backup
>
> mount_unmount unmount
>
> echo Running rsync
> rsync -av --delete /bootdir/boot exscape::backup-freebsd/chaos
> rsync -av --delete /root exscape::backup-freebsd/chaos
> rsync -av --delete ~serenity exscape::backup-freebsd/chaos
>
> echo 'All done!'
>
> -------------------
Sorry for the monologue, but I ran in to this again, this time with a  
panic again. Similar but not identical to the old one.

OK, so I figured I would update my "untouched" src clone (used to save  
bandwidth from the FBSD SVN server when I feel I need to start with a  
*clean* source tree), now that there have been quite a few changes  
since that revision.

I pretty much did this (if other shells are different, !$ in bash is  
the last argument to the previous command.)
1) Clean up /usr/src from "my" stuff
2) svn update
3) svn diff and svn status, to make sure it was clean
4) zfs promote tank/usr/src ## usr/src was a clone of the untouched,  
read-only fs "tank/usr/src_r194478-UNTOUCHED"
5) zfs destroy -r tank/usr/src_r194478-UNTOUCHED ## The old one,  
obviously
6) zfs snapshot tank/usr/src_at_r195488_UNTOUCHED
7) zfs clone !$ tank/usr/src_r195488-UNTOUCHED
8) zfs promote !$
9) zfs set readonly=on !$
10) And, in case it may matter, I slightly modified the contents of / 
usr/src afterwards (applied two patches that aren't merged into HEAD  
(yet?)).

... I THINK that's it. Since my bash_history got killed in the panic,  
I may be slighty wrong. In any case, usr/src is now a clone of the  
readonly UNTOUCHED fs:
[root_at_chaos ~]# zfs get origin tank/usr/src
NAME          PROPERTY   
VALUE                                             SOURCE
tank/usr/src  origin    tank/usr/ 
src_r195488_UNTOUCHED_at_r195488_UNTOUCHED  -

I then ran the backup script just posted in this thread:

[root_at_chaos ~]# bash backup.sh backup
Creating ggate device
Sleeping for 5 seconds...
Importing pool
Taking snapshots
Starting backup...
attempting destroy slave/usr/src_r194478-UNTOUCHED_at_backup-20090709-1250
success
attempting destroy slave/usr/src_r194478-UNTOUCHED_at_r194478-UNTOUCHED
failed - trying rename  to slave/usr/src_r194478-UNTOUCHED_at_recv-38883-1
failed (2) - will try again on next pass
attempting destroy slave/usr/src_r194478-UNTOUCHED_at_backup-20090709-1235
success
attempting destroy slave/usr/src_r194478-UNTOUCHED
failed - trying rename  to slave/recv-38883-2
failed (2) - will try again on next pass
promoting slave/usr/src
another pass:
attempting destroy slave/usr/src_r194478-UNTOUCHED
success
attempting destroy slave/usr/src_at_r194478-UNTOUCHED
success
attempting rename slave/usr/src to slave/usr/src_r195488_UNTOUCHED
success
receiving incremental stream of tank_at_backup-20090709-1328 into  
slave_at_backup-20090709-1328
received 312B stream in 1 seconds (312B/sec)
receiving incremental stream of tank/tmp_at_backup-20090709-1328 into  
slave/tmp_at_backup-20090709-1328
received 119KB stream in 1 seconds (119KB/sec)
receiving incremental stream of tank/var_at_backup-20090709-1328 into  
slave/var_at_backup-20090709-1328
received 211KB stream in 1 seconds (211KB/sec)
receiving incremental stream of tank/var/log_at_backup-20090709-1328 into  
slave/var/log_at_backup-20090709-1328
received 468KB stream in 1 seconds (468KB/sec)
receiving incremental stream of tank/var/crash_at_backup-20090709-1328  
into slave/var/crash_at_backup-20090709-1328
received 312B stream in 1 seconds (312B/sec)
receiving incremental stream of tank/root_at_backup-20090709-1328 into  
slave/root_at_backup-20090709-1328
received 156KB stream in 1 seconds (156KB/sec)
receiving incremental stream of tank/usr_at_backup-20090709-1328 into  
slave/usr_at_backup-20090709-1328
received 302KB stream in 1 seconds (302KB/sec)
receiving incremental stream of tank/usr/obj_at_backup-20090709-1328 into  
slave/usr/obj_at_backup-20090709-1328
received 8.52MB stream in 8 seconds (1.07MB/sec)
receiving incremental stream of tank/usr/ 
src_r195488_UNTOUCHED_at_r195488_UNTOUCHED into slave/usr/ 
src_r195488_UNTOUCHED_at_r195488_UNTOUCHED
received 112MB stream in 43 seconds (2.60MB/sec)
receiving incremental stream of tank/usr/ 
src_r195488_UNTOUCHED_at_backup-20090709-1328 into slave/usr/ 
src_r195488_UNTOUCHED_at_backup-20090709-1328
received 312B stream in 1 seconds (312B/sec)
receiving incremental stream of tank/usr/ports_at_backup-20090709-1328  
into slave/usr/ports_at_backup-20090709-1328
received 312B stream in 1 seconds (312B/sec)
receiving incremental stream of tank/usr/ports/ 
distfiles_at_backup-20090709-1328 into slave/usr/ports/ 
distfiles_at_backup-20090709-1328
received 312B stream in 1 seconds (312B/sec)
found clone origin slave/usr/src_r195488_UNTOUCHED_at_r195488_UNTOUCHED
receiving incremental stream of tank/usr/src_at_backup-20090709-1328 into  
slave/usr/src_at_backup-20090709-1328
received 216KB stream in 1 seconds (216KB/sec)
local fs slave/usr/src does not have fromsnap (backup-20090709-1250 in  
stream); must have been deleted locally; ignoring
Exporting pool
Read from remote host 192.168.1.10: Operation timed out

... and the DDB output (first part copy/paste from kgdb, second part  
handwritten since the kgdb BT was totally broken, as I expected.)

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x70
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff8036e855
stack pointer           = 0x28:0xffffff803ea637d0
frame pointer           = 0x28:0xffffff803ea637e0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 38905 (zpool)

_sx_slock()
dmu_buf_update_user()+0x47
zfs_znode_dmu_fini()
zfs_freebsd_reclaim()
VOP_RECLAIM_APV()
vgonel()
vflush()
zfs_umount()
dounmount()
unmount()
syscall()
Xfast_syscall()

BTW, what's with the "local fs slave/usr/src does not have fromsnap  
(backup-20090709-1250 in stream); must have been deleted locally;  
ignoring" part?

This is what I get when I try an incremental backup now:

[root_at_chaos ~]# bash backup.sh backup
Creating ggate device
Sleeping for 5 seconds...
Importing pool
Taking snapshots
Starting backup...
local fs slave/usr/src does not have fromsnap (backup-20090709-1250 in  
stream); must have been deleted locally; ignoring
receiving incremental stream of tank_at_backup-20090709-1328 into  
slave_at_backup-20090709-1328
snap slave_at_backup-20090709-1328 already exists; ignoring
local fs slave/usr/src does not have fromsnap (backup-20090709-1250 in  
stream); must have been deleted locally; ignoring
warning: cannot send 'tank/tmp_at_backup-20090709-1328': Broken pipe
warning: cannot send 'tank/tmp_at_backup-20090709-1406': Broken pipe
Exporting pool
Running rsync
... rsync runs, no panic, but no ZFS backup, either. Guess it's time  
for another "initial" backup, i.e. start all over with 0 snapshots...

The initial backup worked just fine, it found the clone/origin etc.  
and made it work.
Stripped from comments and echo statements, the function is simply:

function initial {
     for BACK in $(zfs list -t snapshot -H -r tank | awk '{print  
$1}'); do zfs destroy $BACK; done
     zpool destroy slave 2>/dev/null; sleep 3; ggatel destroy -u 666  
2>/dev/null; sleep 3 # Clean up if needed
     ggatel create -u 666 /mnt/backup/chaos/slavefile; sleep 3
     zpool create -f -R /slave slave ggate666 && NOW=$(date +"backup-%Y 
%m%d-%H%M") || die 'Unable to create pool'
     zfs snapshot -r tank_at_$NOW || die 'Unable to snapshot'
     zfs send -R tank_at_$NOW | zfs recv -vFd slave
     mount_unmount unmount
     echo $NOW > /root/.last-backup
}

After that, incrementals are fine again.

Regards,
Thomas
Received on Thu Jul 09 2009 - 10:36:28 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:51 UTC