Re: Fatal trap 12: page fault while in kernel mode -- Stopped at atomic_subtract_int+0x4

From: Fabian Keil <freebsd-listen_at_fabiankeil.de>
Date: Wed, 19 Oct 2011 23:09:15 +0200
Fabian Keil <freebsd-listen_at_fabiankeil.de> wrote:

> I pretty reproducible get the following (handtranscribed) panic
> when sending an zfs snapshot to geli provider based on an USB
> stick that disappears (due to a bug, or because it's unplugged): 
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0: apic id = 00
> fault virtual address = 0x288
> fault code	      = supervisor write data, page not present
> instruction pointer   = 0x20:0xffffffff808e2984
> stack pointer         = 0x28:0xffffff800023fba0
> frame pointer         = 0x28:0xffffff800023fbb0
> code segment          = base 0x0, limit 0xfffff, type 0x1b
>                       = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags      = interrupt enabled, resume, IOPL = 0
> current process       = 13 (g_up)
> [ thread pid 13 tid 100010 ]
> Stopped at    atomic_subtract_int+0x4: lock subl %esi,(%rdi)
> db> where
> Tracing pid 13 tid 100010 td 0xfffffe00027998c0
> atomic_subtract_int() at atomic_subtract_int+0x4
> g_io_schdule_up() at g_io_schedule_up+0xa6
> g_up_procbody() at g_up_procbody+0x5c
> fork_exit() at fork_exit+0x11f
> fork_trampoline() at fork_trampoline+0xe
> --- trap 0, rip = 0, rsp = 0xffffff800023fd00, rbp 0 ---
> 
> It seems to be important that ZFS is actually writing to the stick.
> If the stick is unplugged while the operation is stalled for other
> reasons, this particular panic doesn't seem to occur.
> 
> While I end up in the debugger, dumping core doesn't work
> and produces a double fault and a bunch of duplicated
> messages (again handtranscribed):

The duplicated messages have been recently fixed.
 
> db> dump
> Dumping 443 out of 1974 MB: Dumping 443 out of 1974 MB
> 
> Fatal double fault
> Fatal double fault
> rip = 0xffffffff8066a9e0
> rip = 0xffffffff8066a9e0
> rsp = 0xffffff800023c000
> rsp = 0xffffff800023c000
> rbp = 0xffffff800023c040
> rbp = 0xffffff800023c040
> cpuid = 0; cpuid = 0; apic id = 00
> apic id = 00
> panic: double fault
> panic: double fault
> cpuid = 0
> cpuid = 0
> KDB: stack backtrace:
> KDB: stack backtrace:
> db_trac_self_wrapper() at db_trace_self_wrapper+0x2a
> kdb_backtrace() at kdb_backtrace+0x37
> panic() at panic+0x187
> dblfault_handler() at dblfault_handler+0xa4
> Xdblfault() at Xdblfault+0xa8
> --- trap 0x17, rip = 0xffffffff8066a9e8, rsp = 0xffffffff80e56158, rbp = 0xffffff800023c040 ---
> mi_switch() at mi_switch+0x270
> critical_exit() at critical_exit+0x9b
> spinlock_exit() at spinlock_exit+0x17
> mi_switch() at mi_switch+0x275
> critical_exit() at critical_exit+0x9b
> spinlock_exit() at spinlock_exit+0x17
> [several pages of the previous three lines skipped]
> mi_switch() at mi_switch+0x275
> critical_exit() at critical_exit+0x9b
> spinlock_exit() at spinlock_exit+0x17
> intr_even_schedule_thread() at intr_event_schedule_thread+0xbb
> ahci_end_transaction() at ahci_end_transaction+0x398
> ahci_ch_intr() at ahci_ch_intr+0x2b5
> ahcipoll() at ahcipoll+0x15
> xpt_polled_action() at xpt_polled_action+0xf7
> 
> I first noticed the problem with CURRENT from a week ago,
> but given that USB sticks don't usually disappear for me
> while sending snapshots to them, the problem might not
> be new.
> 
> I'm using amd64, the panic above is from a custom kernel
> without WITNESS and INVARIANTS, but enabling them doesn't
> seem to affect the trace before the double fault.
> 
> I wasn't able to reproduce the panic by unplugging the stick
> while writing to the pool using dd (but only tried once).

Here's another one, again with recent HEAD.

This time the USB stick disappeared while the pool was
being scrubbed and dumping actually worked. The stick
seems to reproducibly disappear after scrubbing it for
a while and the panic seems to be reproducible as well.

The stack trace looks a bit different, but I'm not sure if
this is because it's a slightly different situation or because
of changes in HEAD.

fk_at_r500 /usr/crash $kgdb kernel.3/kernel vmcore.3
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
ugen7.2: <Generic> at usbus7 (disconnected)
umass0: at uhub7, port 2, addr 2 (disconnected)
(da0:umass-sim0:0:0:0): Request completed with CAM_REQ_CMP_ERR
(da0:umass-sim0:0:0:0): Retrying command
(da0:umass-sim0:0:0:0): Selection timeout
(da0:umass-sim0:0:0:0): Retrying command
(da0:umass-sim0:0:0:0): Selection timeout
(da0:umass-sim0:0:0:0): Retrying command
(da0:umass-sim0:0:0:0): Selection timeout
(da0:umass-sim0:0:0:0): Retrying command
(pass2:umass-sim0:0:0:0): lost device
(pass2:umass-sim0:0:0:0): removing device entry
(da0:umass-sim0:0:0:0): lost device - 1 outstanding
(da0:umass-sim0:0:0:0): Error 6, Retries exhausted
(da0:umass-sim0:0:0:0): oustanding 0
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2958844928, length=6144)]
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2958851584, length=21504)]
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2390255616, length=18944)]
GEOM_ELI: g_eli_read_done() failed label/extreme.eli[READ(offset=270336, length=8192)]


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x284
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff81314512
stack pointer           = 0x28:0xffffff8000244b90
frame pointer           = 0x28:0xffffff8000244bc0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 13 (g_down)

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /boot/kernel/zfs.ko.symbols...done.
done.
[...]
Loaded symbols for /boot/kernel/fdescfs.ko
#0  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:260
260             if (textdump && textdump_pending) {
(kgdb) where
#0  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:260
#1  0xffffffff80333380 in db_dump (dummy=Variable "dummy" is not available.
) at /usr/src/sys/ddb/db_command.c:537
#2  0xffffffff80332cb1 in db_command (last_cmdp=0xffffffff80dfd100, cmd_table=Variable "cmd_table" is not available.
) at /usr/src/sys/ddb/db_command.c:448
#3  0xffffffff80332f00 in db_command_loop () at /usr/src/sys/ddb/db_command.c:501
#4  0xffffffff80335039 in db_trap (type=Variable "type" is not available.
) at /usr/src/sys/ddb/db_main.c:229
#5  0xffffffff806a1061 in kdb_trap (type=12, code=0, tf=0xffffff8000244ae0) at /usr/src/sys/kern/subr_kdb.c:625
#6  0xffffffff80903bbd in trap_fatal (frame=0xffffff8000244ae0, eva=Variable "eva" is not available.
) at /usr/src/sys/amd64/amd64/trap.c:813
#7  0xffffffff80903f26 in trap_pfault (frame=0xffffff8000244ae0, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:734
#8  0xffffffff8090448f in trap (frame=0xffffff8000244ae0) at /usr/src/sys/amd64/amd64/trap.c:473
#9  0xffffffff808ee1e3 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:228
#10 0xffffffff81314512 in g_eli_start (bp=0xfffffe000c44cd98) at /usr/src/sys/modules/geom/geom_eli/../../../geom/eli/g_eli.c:320
#11 0xffffffff80600304 in g_io_schedule_down (tp=Variable "tp" is not available.
) at /usr/src/sys/geom/geom_io.c:632
#12 0xffffffff8060060c in g_down_procbody (arg=Variable "arg" is not available.
) at /usr/src/sys/geom/geom_kern.c:110
#13 0xffffffff806399ef in fork_exit (callout=0xffffffff806005b0 <g_down_procbody>, arg=0x0, frame=0xffffff8000244c50) at /usr/src/sys/kern/kern_fork.c:995
#14 0xffffffff808ee70e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:602
#15 0x0000000000000000 in ?? ()
#16 0x0000000000000000 in ?? ()
#17 0x0000000000000001 in ?? ()
#18 0x0000000000000000 in ?? ()
[...]
#39 0xffffffff80e398f0 in sleepq_chains ()
#40 0xfffffe0002713888 in ?? ()
#41 0x0000000000000000 in ?? ()
#42 0xfffffe0002713460 in ?? ()
#43 0xffffff8000244ad0 in ?? ()
#44 0xffffff8000244a78 in ?? ()
#45 0xfffffe00027138c0 in ?? ()
#46 0xffffffff80693ca0 in sched_switch (td=0xffffffff806005b0, newtd=0x0, flags=Variable "flags" is not available.
) at /usr/src/sys/kern/sched_ule.c:1853
Previous frame inner to this frame (corrupt stack?)
(kgdb) f 10
#10 0xffffffff81314512 in g_eli_start (bp=0xfffffe000c44cd98) at /usr/src/sys/modules/geom/geom_eli/../../../geom/eli/g_eli.c:320
320     }
(kgdb) p *bp
$1 = {bio_cmd = 1 '\001', bio_flags = 0 '\0', bio_cflags = 0 '\0', bio_pflags = 255 'ÿ', bio_dev = 0x0, bio_disk = 0x0, bio_offset = 4022607872, bio_bcount = 0,
  bio_data = 0xffffff8005747000 "\200*ø\f", bio_error = 0, bio_resid = 0, bio_done = 0xffffffff811a72d0 <vdev_geom_io_intr>, bio_driver1 = 0xfffffe0004c83828, bio_driver2 = 0x0,
  bio_caller1 = 0xfffffe001cf5ba50, bio_caller2 = 0x0, bio_queue = {tqe_next = 0xfffffe000c44b3a0, tqe_prev = 0xffffffff80e1c840}, bio_attribute = 0x0, bio_from = 0xfffffe000c782a80,
  bio_to = 0xfffffe000cc8ca00, bio_length = 8192, bio_completed = 0, bio_children = 1, bio_inbed = 0, bio_parent = 0x0, bio_t0 = {sec = 243, frac = 14970883905652654531},
  bio_task = 0, bio_task_arg = 0x0, bio_classifier1 = 0x0, bio_classifier2 = 0x0, bio_pblkno = 0}

Fabian

Received on Wed Oct 19 2011 - 19:13:33 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:19 UTC