Re: "New" ZFS crash on FS (pool?) unmount/export

From: Thomas Backman <serenity_at_exscape.org> Date: Thu, 9 Jul 2009 11:01:51 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:51 UTC

On Jun 20, 2009, at 09:11, Thomas Backman wrote:

> I just ran into this tonight. Not sure exactly what triggered it -  
> the box stopped responding to pings at 02:07AM and it has a cron  
> backup job using zfs send/recv at 02:00, so I'm guessing it's  
> related, even though the backup probably should have finished before  
> then... Hmm. Anyway.
>
> r194478.
>
> kernel trap 12 with interrupts disabled
>
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0x288
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff805a4989
> stack pointer           = 0x28:0xffffff803e8b57e0
> frame pointer           = 0x28:0xffffff803e8b5840
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                        = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = resume, IOPL = 0
> current process         = 57514 (zpool)
> panic: from debugger
> cpuid = 0
> Uptime: 10h22m13s
> Physical memory: 2027 MB
>
> (kgdb) bt
> #0  doadump () at pcpu.h:223
> #1  0xffffffff8059c409 in boot (howto=260) at /usr/src/sys/kern/ 
> kern_shutdown.c:419
> #2  0xffffffff8059c85c in panic (fmt=Variable "fmt" is not available.
> ) at /usr/src/sys/kern/kern_shutdown.c:575
> #3  0xffffffff801f1377 in db_panic (addr=Variable "addr" is not  
> available.
> ) at /usr/src/sys/ddb/db_command.c:478
> #4  0xffffffff801f1781 in db_command (last_cmdp=0xffffffff80c38620,  
> cmd_table=Variable "cmd_table" is not available.
> ) at /usr/src/sys/ddb/db_command.c:445
> #5  0xffffffff801f19d0 in db_command_loop () at /usr/src/sys/ddb/ 
> db_command.c:498
> #6  0xffffffff801f3969 in db_trap (type=Variable "type" is not  
> available.
> ) at /usr/src/sys/ddb/db_main.c:229
> #7  0xffffffff805ce465 in kdb_trap (type=12, code=0,  
> tf=0xffffff803e8b5730) at /usr/src/sys/kern/subr_kdb.c:534
> #8  0xffffffff8088715d in trap_fatal (frame=0xffffff803e8b5730,  
> eva=Variable "eva" is not available.
> ) at /usr/src/sys/amd64/amd64/trap.c:847
> #9  0xffffffff80887fb2 in trap (frame=0xffffff803e8b5730) at /usr/ 
> src/sys/amd64/amd64/trap.c:345
> #10 0xffffffff8086e007 in calltrap () at /usr/src/sys/amd64/amd64/ 
> exception.S:223
> #11 0xffffffff805a4989 in _sx_xlock_hard (sx=0xffffff0043557d50,  
> tid=18446742975830720512, opts=Variable "opts" is not available.
> )
>    at /usr/src/sys/kern/kern_sx.c:575
> #12 0xffffffff805a52fe in _sx_xlock (sx=Variable "sx" is not  
> available.
> ) at sx.h:155
> #13 0xffffffff80fe2995 in zfs_freebsd_reclaim () from /boot/kernel/ 
> zfs.ko
> #14 0xffffffff808cefca in VOP_RECLAIM_APV (vop=0xffffff0043557d38,  
> a=0xffffff0043557d50) at vnode_if.c:1926
> #15 0xffffffff80626f6e in vgonel (vp=0xffffff00437a7938) at  
> vnode_if.h:830
> #16 0xffffffff8062b528 in vflush (mp=0xffffff0060f2a000, rootrefs=0,  
> flags=0, td=0xffffff0061528000)
>    at /usr/src/sys/kern/vfs_subr.c:2450
> #17 0xffffffff80fdd3a8 in zfs_umount () from /boot/kernel/zfs.ko
> #18 0xffffffff8062420a in dounmount (mp=0xffffff0060f2a000,  
> flags=1626513408, td=Variable "td" is not available.
> )
>    at /usr/src/sys/kern/vfs_mount.c:1287
> #19 0xffffffff80624975 in unmount (td=0xffffff0061528000,  
> uap=0xffffff803e8b5c00)
>    at /usr/src/sys/kern/vfs_mount.c:1172
> #20 0xffffffff8088783f in syscall (frame=0xffffff803e8b5c90) at /usr/ 
> src/sys/amd64/amd64/trap.c:984
> #21 0xffffffff8086e290 in Xfast_syscall () at /usr/src/sys/amd64/ 
> amd64/exception.S:364
> #22 0x000000080104e49c in ?? ()
> Previous frame inner to this frame (corrupt stack?)

I might have hit the same thing again... only it didn't *crash* this  
time, just freeze! I got a "GEOM_GATE: Device ggateX destroyed" on my  
console, and it stopped responding to pings, keyboard input, etc.
(NOTE: The GEOM_GATE message MAY have been an old one. I *think* it  
was from the night before but can't be sure...)

It obviously happened while running my ugly-hack backup script this  
time too, since it stopped responding to pings ~02:02AM with the  
script running at 02:00.
I'm not sure where it crashed, since snapshots were NOT taken on the  
"local box". It usually crashes during export, long after taking the  
snapshots.

BTW, current system is BETA1 r195422M (dtrace timestamp patch +  
libzfs_sendrecv.c patch ( http://lists.freebsd.org/pipermail/freebsd-current/2009-May/006814.html 
  ).

Here's the script in its relevant entirety (I removed the "initial  
backup" part since it never runs using cron anyway).
I realize it's an ugly hack (and that my bash-fu could be stronger),  
but what the heck. Essentially, it creates a GEOM provider of a file,  
containing a zpool, imports the pool and creates a clone to it, and  
then exports the pool. The export appears to be what causes all the  
trouble - usually, not this time around. Every time it crashes it  
seems to be during or very soon after the export - only this time it  
didn't even take the snapshots.

#!/bin/bash

PATH="$PATH:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/ 
sbin"

function die() {
	echo "$_at_" 2>&1
	zpool export slave 2>&1 > /dev/null
	ggatel destroy -u 666 2>&1 > /dev/null
	exit 1
}

function mount_unmount {
	if [ -z "$1" ]; then
		die 'Invalid argument given to mount_unmount'
	elif [[ "$1" == "mount" ]]; then
		zpool list | grep -q slave
		if [ "$?" = "0" ]; then
			echo Already mounted
			return 0
		fi
		echo Creating ggate device
		ggatel create -u 666 /mnt/backup/chaos/slavefile || die 'Unable to  
create GEOM provider from file'
		echo 'Sleeping for 5 seconds...'
		sleep 5
		echo Importing pool
		zpool import -R /slave slave || die 'Unable to import slave pool'
	elif [[ "$1" == "unmount" ]]; then
		echo Exporting pool
		zpool export slave || die 'Unable to export slave pool'
		ggatel destroy -u 666 || die 'Unable to destroy GEOM provider'
	fi
}

f [ ! -z "$1" ]; then
	case $1 in
	mount) mount_unmount mount; exit 0;;
	unmount) mount_unmount unmount; exit 0;;
	initial) initial; exit 0;;
	backup) ;;
	*) help;;
	esac
else
	help
fi

if [ ! -f "/mnt/backup/chaos/slavefile" ]; then
	echo 'Backup error! slavefile does not exist!' | mail -s "Backup  
error" root
	echo 'Slavefile does not exist!'
	exit 1
fi

mount_unmount mount

CURR=$(date +"backup-%F-%H%M")

echo Taking snapshots
zfs snapshot -r tank_at_$CURR || die 'Unable to create $CURR snapshot'

echo Starting backup...
LAST=$(cat /root/.last-backup)
zfs send -R -I $LAST tank_at_$CURR | zfs recv -Fvd slave

echo $CURR > /root/.last-backup

mount_unmount unmount

echo Running rsync
rsync -av --delete /bootdir/boot exscape::backup-freebsd/chaos
rsync -av --delete /root exscape::backup-freebsd/chaos
rsync -av --delete ~serenity exscape::backup-freebsd/chaos

echo 'All done!'

-------------------

So, in *normal* cases, everything runs just fine. This is the case  
perhaps 90% of the time. In normal *crashes*, it hangs on export with  
the above backtrace. This time, all I know is that it crashes soon  
after starting the backup... during import, perhaps?

Regards,
Thomas