Re: arcmsr crash

From: Scott Long <scottl_at_samsco.org>
Date: Fri, 13 Jul 2007 14:19:13 -0600
John Baldwin wrote:
> On Tuesday 05 June 2007 05:22:38 pm Matt Reimer wrote:
>> Once a week or so we're seeing a panic with a -current kernel built
>> just before the gcc 4.2 import (maybe three weeks ago). The box has a
>> Supermicro X7DBE/X7DBE+ motherboard with two Xeon 5160s, 16G RAM, and
>> an Areca 1220 controller with eight 500G disks connected.
>>
>> Does this indicate that the arcmsr driver is at fault:
>>
>> Tracing command irq16: arcmsr0 pid 26 tid 100018 td 0xffffff040fc5b000
>> cpustop_handler() at cpustop_handler+0x35
>> ipi_nmi_handler() at ipi_nmi_handler+0x2e
>> trap() at trap+0x365
>> nmi_calltrap() at nmi_calltrap+0x8
>> --- trap 0x13, rip = 0xffffffff8041ab11, rsp = 0xffffffffab59eff0, rbp
>> = 0xffffffffac0a37d0 ---
>> siocnclose() at siocnclose+0x21
>> sio_cnputc() at sio_cnputc+0x89
>> cnputc() at cnputc+0x6a
>> putchar() at putchar+0x5f
>> kvprintf() at kvprintf+0xd45
>> printf() at printf+0xe1
>> panic() at panic+0x145
>> xpt_done() at xpt_done+0x14a
>> arcmsr_interrupt() at arcmsr_interrupt+0x2df
>> ithread_loop() at ithread_loop+0x108
>> fork_exit() at fork_exit+0xaa
>> fork_trampoline() at fork_trampoline+0xe
>> --- trap 0, rip = 0, rsp = 0xffffffffac0a3d30, rbp = 0 ---
> 
> Looks like it has panic'd here:
> 
>                 switch (done_ccb->ccb_h.path->periph->type) {
>                 case CAM_PERIPH_BIO:
>                         mtx_lock(&cam_bioq_lock);
>                         TAILQ_INSERT_TAIL(&cam_bioq, &done_ccb->ccb_h,
>                                           sim_links.tqe);
>                         done_ccb->ccb_h.pinfo.index = CAM_DONEQ_INDEX;
>                         mtx_unlock(&cam_bioq_lock);
>                         swi_sched(cambio_ih, 0);
>                         break;
>                 default:
>                         panic("unknown periph type %d",
>                             done_ccb->ccb_h.path->periph->type);
>                 }
> 
> which should seem to indicate that, yes, it is a driver bug.
> 

The doneq has gotten corrupted somehow.  The only real way that this
could happen is if xpt_done() was called twice on the same ccb.  Whether
this is a hardware bug (hardware completing the same command twice) or
a driver bug is unknown.  I'll try to add some seatbelts to CAM to
detect this kind of condition.  But yes, it's ultimately something in
the arcmsr subsystem that is at fault.

Scott
Received on Fri Jul 13 2007 - 18:19:30 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:14 UTC