RE: escalation stage 2 [was:RE: Big and ugly bug in 5.1-release]

From: Harald Schmalzbauer <h_at_schmalzbauer.de>
Date: Wed, 16 Jul 2003 13:42:34 +0200
Lukas Ertl wrote:
> On Wed, 16 Jul 2003, Harald Schmalzbauer wrote:
>
> > Now after resetting the machine which was hung by "sysinstall" it claims
> > that ad4 (one of two mirrored 30GB 2.5" disks" was absent (see
> dmesg below)
> > Now the controller warns me that one drive is bad (which in fact is
> > definatley not) and allows me to select "continue boot"
>
> Did you try to replace that defective drive?

If it was defect I'd replace it. But I'm very absolutely sure it isn't
defective, it's a bug.

>
> > That's what I do and after kernel probing the machine reboots with the
> > folowing error (well, this takes some time to typewrite it from
> my monchrome
> > screen):
> >
> > Fatal trap 12: page fault while in kernel mode
> > fault virtual address = 0x10
> > fault code=			supervisor read, page not present
> > instruction pinter=	0x8:0xc014a0a6
> > stack pointer=		0x10:0xcce65bd8
> > frame pointer=		0x10:0xcce65c58
> > code	segment		= base 0x0, limit 0xfffff type 0x1b
> > 				= DPL 0, pres 1, def32 1, gran 1
> > processor eflags		= interrupt enabled, resume, IOPL=0
> > current process		= 4(g_down)
> > trap number			= 12
> > panic: page fault
> >
> > Then it reboots!
>
> Can you get a coredump and a backtrace? That would be very helpful in
> debugging.

Not on that machine since I can't install a new kernel. I can make a ddb
kernel for finding out the initially crash but not for the consequences,
named an inacessable RAID1

>
> > Now please give me a hint what to do. This is my brand new
> fileserver which
> > collected all improtant data from the last decade and since
> it's brand new I
> > didn't manage any backup.
>
> Funny, there's always an excuse why there are no backups.

I really never had any backup when it was needed *grumpf*

>
> > When testing the hardware (unplugging one drive while the machine was
> > running) I had the same error but I thought that would never
> happen under
> > normal circumstances.
>
> Well, if you did run tests and saw the errors, why did you think it
> wouldn't happen "under normal circumstances"?

Because I accepted the risk that if one drive really fails I have to
manually recover it, assuming I have a backup to be sure. But now it's
definatly no hardware failure.

>
> IMHO it would be better if you start over with a clean machine and two new
> disks. Sounds very much like you damaged the drive.

I'll now release the RAID1 and create a new one while duplicating data form
ad6 with the controller's BIOS. I'm sure this will work but it's not the
sense.
The machine should boot with one drive detached!

Perhaps Soren has an idea?

Best regards,

-Harry

>
> regards,
> le
>
> --
> Lukas Ertl                             eMail: l.ertl_at_univie.ac.at
> UNIX-Systemadministrator               Tel.:  (+43 1) 4277-14073
> Zentraler Informatikdienst (ZID)       Fax.:  (+43 1) 4277-9140
> der Universität Wien                   http://mailbox.univie.ac.at/~le/
>
Received on Wed Jul 16 2003 - 02:42:52 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:15 UTC