On 7/13/07, Scott Long <scottl_at_samsco.org> wrote: > John Baldwin wrote: > > On Tuesday 05 June 2007 05:22:38 pm Matt Reimer wrote: > >> Once a week or so we're seeing a panic with a -current kernel built > >> just before the gcc 4.2 import (maybe three weeks ago). The box has a > >> Supermicro X7DBE/X7DBE+ motherboard with two Xeon 5160s, 16G RAM, and > >> an Areca 1220 controller with eight 500G disks connected. > >> > >> Does this indicate that the arcmsr driver is at fault: > >> > >> Tracing command irq16: arcmsr0 pid 26 tid 100018 td 0xffffff040fc5b000 > >> cpustop_handler() at cpustop_handler+0x35 > >> ipi_nmi_handler() at ipi_nmi_handler+0x2e > >> trap() at trap+0x365 > >> nmi_calltrap() at nmi_calltrap+0x8 > >> --- trap 0x13, rip = 0xffffffff8041ab11, rsp = 0xffffffffab59eff0, rbp > >> = 0xffffffffac0a37d0 --- > >> siocnclose() at siocnclose+0x21 > >> sio_cnputc() at sio_cnputc+0x89 > >> cnputc() at cnputc+0x6a > >> putchar() at putchar+0x5f > >> kvprintf() at kvprintf+0xd45 > >> printf() at printf+0xe1 > >> panic() at panic+0x145 > >> xpt_done() at xpt_done+0x14a > >> arcmsr_interrupt() at arcmsr_interrupt+0x2df > >> ithread_loop() at ithread_loop+0x108 > >> fork_exit() at fork_exit+0xaa > >> fork_trampoline() at fork_trampoline+0xe > >> --- trap 0, rip = 0, rsp = 0xffffffffac0a3d30, rbp = 0 --- > > > > Looks like it has panic'd here: > > > > switch (done_ccb->ccb_h.path->periph->type) { > > case CAM_PERIPH_BIO: > > mtx_lock(&cam_bioq_lock); > > TAILQ_INSERT_TAIL(&cam_bioq, &done_ccb->ccb_h, > > sim_links.tqe); > > done_ccb->ccb_h.pinfo.index = CAM_DONEQ_INDEX; > > mtx_unlock(&cam_bioq_lock); > > swi_sched(cambio_ih, 0); > > break; > > default: > > panic("unknown periph type %d", > > done_ccb->ccb_h.path->periph->type); > > } > > > > which should seem to indicate that, yes, it is a driver bug. > > > > The doneq has gotten corrupted somehow. The only real way that this > could happen is if xpt_done() was called twice on the same ccb. Whether > this is a hardware bug (hardware completing the same command twice) or > a driver bug is unknown. I'll try to add some seatbelts to CAM to > detect this kind of condition. But yes, it's ultimately something in > the arcmsr subsystem that is at fault. Do you have any suggestions of instrumentation printfs I could add to zero in on what part of the driver is at fault? MattReceived on Fri Jul 13 2007 - 18:46:24 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:14 UTC