gvinum and RAID-5, page-fault on lost disc?

From: Daniel Eriksson <daniel_k_eriksson_at_telia.com> Date: Thu, 12 Aug 2004 00:44:48 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:05 UTC

As I reported on the "Vinum status" thread a few days ago, gvinum is not
very graceful when a disc disappears/dies in a RAID-5 array during
operation. The machine I was testing this on was an SMP machine, but when I
recompiled the kernel with ddb/kdb support I removed "option SMP" just to
take that out of the equation.

The system:
-----------
ASUS P2B-DS mobo, 2 x P3/700, 1GB ECC RAM, latest BIOS (1014, beta 003)
Some old 9GB ATA disc for the OS on the onboard 440BX UDMA33 chipset
4 x 120GB Maxtor SATA discs on a HighPoint RocketRAID 1540 (HPT374 chipset)
The 4 120GB discs were put together in a RAID-5 array using "classic" vinum
(so they have overlapping slices and VINUM slicetypes).
Kernel/userland dated: 2004.08.10.21.00.00 (no CFLAGS/COPTFLAGS, but I use
"CPUTYPE?=p3")
gvinum started from /boot/loader.conf (geom_vinum_load="YES")

The crash:
----------
Stop all non-essential processes (to protect against unnecessary file
corruption, normally I've got Apache+PHP4+MySQL running among other things).
Set up one process to write to the RAID-5 array (I have used a simple sftp
to another machine, pulling down big files).
Pull one of the SATA-cables.
*boom*

I have a vmcore dump and a kernel.debug, but I can't seem to get gdb53 to do
what I want (not very familiar with gdb), so here's the output I took down
on paper from within ddb. The first 5 lines are exactly what I would expect
to happen (and what "classic" vinum also did), but then something goes wrong
and the machine page-faults:

ad8: TIMEOUT - READ_DMA retrying (2 retries left) LBA = 196849498
ad8: WARNING - removed from configuration
gvinum: lost drive 'vinumdrive2'
FOO: sd raid5.p0.s2 is down
FOO: plex raid5.p0 is degraded

Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x64
fault code            = supervisor read, page not present
instruction pointer   = 0x8:0xc08580fe
stack pointer         = 0x10:0xdcf9dc00
frame pointer         = 0x10:0xdcf9dc20
code segment          = base 0x0, limit 0xfffff, type 0x1b
                        DPL 0, pres 1, def32 1, gran 1
processor eflags      = interrupt enabled, resume, IOPL = 0
current process       = 2 (g_event)
[thread 100038]
Stopped at gv_drive_access+0x1e : movl 0x64(%eax),%ecx

db> where
gv_drive_access(c2608b00,ffffffff,ffffffff,0,0) at gv_drive_access+0x1e
g_access(c260a8c0,ffffffff,ffffffff,0,c260a900) at g_access+0x16b
gv_plex_orphan(c260a8c0, .......)
g_orphan_register
one_event
g_run_events
g_event_procbody
fork_exit
fork_trampoline
--- trap 0x1, eip = 0, esp = 0xdcf9dd7c, ebp = 0 ---

Attached is the kernel config file and the dmesg. Let me know if there is
anything you want me to do with gdb to further track this down. I could
probably even arrange for a guest account on this machine if someone wants
to take a closer look at the vmcore file.

/Daniel Eriksson