Re: Custom kernels causing Promise ATA RAID to go down

From: Alastair G. Hogge <agh_at_tpg.com.au>
Date: Fri, 11 Jun 2004 00:14:42 +1000
On Thursday 10 June 2004 14:28, Alastair G. Hogge wrote:

[I know I'm replying to myself here. But I've got some more info.]
[See arse end of email]

> On Tuesday 08 June 2004 15:56, Allan Fields wrote:
> > On Sun, Jun 06, 2004 at 07:40:15PM +1000, Alastair G. Hogge wrote:
> > > For a couple of weeks now I've been having problems with my custom
> > > kernel crashing the system. I've re-cvsup'd and nuked /usr/obj and
> > > rebuild worlds
> > >
> > > The problem is that my kernel keeps causing ATA DMA READ/WRITE
> > > errors and then eventually causing my RAID array to go down, thus
> > > needing a deletation and re-definition thru the BIOS. Plus uncountable
> > > fsck run thru.
> >
> > Yup, it sucks.. basically if your RAID goes bad, with most Promise
> > controllers you need to reboot into BIOS and wait a long time for
> > it to rebuild.  I found the Promise BIOS a little lacking.  I'm not
> > a fan of oblique menu-based tools, especially when working w/ disks.
> >
> > Online rebuild is available on some ATA controllers but can also be
> > slow.
> >
> > > I don't know how to capture and store the output. As the system just
> > > basicly hangs and freezes the keyboard. Most of the time I've been X,
> > > which can only be solved with a hard reboot.
> >
> > Also, just curious, but are you swapping off the RAID?
>
> Well not user if there's any swapping going on. I have 1024M of system
> memory, and the swap partition is located on the array.
>
> > If your RAID has read/write errors and you use it for swap, it is
> > likely that it will cause the system to lock, possibly including
> > the console.
> >
> > Do you have a second machine to use as a serial console?
>
> Unfortunately not. I'm working on getting one setup thou.
>
> > Another thing to try: try pinging the host and see if it responds.
>
> Yes I can still ping the machine.
>
> > I use a null-modem cable and tip(1): When I was having problems w/
> > my Promise controller, I'd typically capture the output using
> > script(1) or screen(1).
>
> Ahhh very handy. Thanks :-)
>
> > > Running a GENERIC kernel is (with debuging things removed) is so slow.
> > > X/KDE performs so poorly now.
> >
> > What's interesting is why this only happens w/ your custom kernels.
>
> Actually, I think a GENERIC kernel just last longer then a custom. I left a
> GENERIC running for 6+ hours the other day while I went out, when I came
> back the system had locked up.
>
> > I've also experienced instability with Promise RAID controllers in
> > the past but didn't ever use a GENERIC kernel.  I'm interested in
> > this issue, but don't know if it's related.
> >
> > Also: Perhaps your Promise controller or drives are overheating?
>
> Thought about this. But I don't think it is the case. I've had the 2 HD for
> sometime now, and I they used to 24/7. I have 3 fans running in my tower
> case.
>
> I've just re-built world again recently and I'm still getting problems.
>
> I need to get that other machine going.

When the system goes now, well down into the kernel debugger. I can no longer 
ping the host. I've also been trying to use telnet on a WindowsXP box, but 
that hangs when the system goes down, or I can't connect.

Anyways I wrote down the following:
ad6: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=128
ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=255

ad6: FAILURE - ATA_INENTIFY no interrupt
ad6: FAILURE - ATA_INENTIFY no interrupt
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ad4: FAILURE - ATA_INENTIFY no interrupt
ad4: FAILURE - ATA_INENTIFY no interrupt
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done
ar1073679450: unknow array type in ar_done

Fatal trap 12: page fault while in kernel mode
fault virtual address		= 0x3fff0c62
fault code				= supervisor read, page not present
instruction pointer		= 0x8:0xc05782be
stack pointer			= 0x10:0xdb642b60
fame pointer			= 0x10:0xdb642c84
code segment			= base 0x0, limit 0xfffff, type 0x1b
					= DPL 0, pres 1, def32 1, gran 1
processor	 flags		= interrupt enabled, resume, iopl=0
current process		= 28 (swi8: tty:sio clock)
kernel: type 12 trap, code 0
stopped at	cvmp+0x16:	rope cmpsl	(%esi),%es:(%edi)
db> trace
bcmp(c235b5800) at bcmp+0x16
in6_purgeaddr(c235b5800) at in6_purgeaddr+0x72
nd6_timer(0) at nd6_timer+0x272
softclock(0) at softclock+0176
ithread_loop(c272c480,db64248,c227a480,e04723b4,0) at ithread_loop+0x134
fork_exit(c04723b4, c227a480, db642d48) at fork_exit+0x98
fork_trampoline() at fork_trampoline+0x8
--- trap 0x1, eip = 0, esp = 0db642d7c, ebp = 0 ---
Received on Thu Jun 10 2004 - 12:14:21 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:56 UTC