Re: SATA disks suddenly stop working

From: Gary Jennejohn <gary.jennejohn_at_freenet.de>
Date: Sat, 28 Feb 2009 22:05:45 +0100
On Sat, 28 Feb 2009 14:48:52 -0500
Elliot Schlegelmilch <elliot+list_at_schlegelmilch.org> wrote:

> On Thu, Feb 26, 2009 at 12:22:12PM +0100, Gary Jennejohn wrote:
> > On Wed, 25 Feb 2009 21:56:38 +0200
> > Alexander Motin <mav_at_FreeBSD.org> wrote:
> > 
> > > Gary Jennejohn wrote:
> > > > I've been having lots of problems with SATA drives attached to higher
> > > > port numbers, namely ata5 and ata6.
> > > > 
> > > > I was installing Linux under qemu today and it had been running for
> > > > several hours and had installed multi-gigabytes of data when qemu
> > > > just stopped.
> > > > 
> > > > I noticed that all I/O to the disk had ceased.
> > > > 
> > > > Doing "atacontrol reinit" on the port (ata5) resulted in a message
> > > > that the device was not configured, which was patently false since
> > > > qemu had just been merrily writing to it.
> > > > 
> > > > This with a kernel made from sources updated today at about 2 PM (GMT+1).
> > > > 
> > > > I've also seen problems with a disk attached to ata6.  It just sort
> > > > of disappears after a while.
> > > > 
> > > > Disks attached to ata2, ata3 and ata4 don't exhibit any problems.
> > > 
> > > You have told much and same time gave nothing that can be used.
> > > 
> > 
> > I was only interested in whether others have seen this problem.  I was
> > not looking for a solution.
> > 
> > > What controller do you have? What drives on what channels? Is there any 
> > > kernel messages about the problem? Have you tried to enable verbose 
> > > messages to get additional details?
> > > 
> > 
> > atapci0_at_pci0:0:17:0:    class=0x010601 card=0xb0021458 chip=0x43911002 rev=0x00 hdr=0x00
> >     vendor     = 'ATI Technologies Inc'
> >     class      = mass storage
> >     subclass   = SATA
> > 
> > There were no kernel messages at all, the drive simply hung.
> > 
> > I'll do a verbose boot and try to reproduce the disk hang later.
> > 
> > > Reinit could return ENXIO if it already was in progress. Disappearing 
> > > drives are also can be related to that reinit. Can't it be just a real 
> > > hardware problem?
> > > 
> > 
> > I should have mentioned that the error returned was about some IOCTL.
> > Can't remember which one right now, but the error message did include
> > that the device was not configured.
> > 
> > I've also noticed several times in the past when the problem occurred
> > that the BIOS could not enumerate the AHCI disks anymore.  I had to
> > do a POR.  Seems that the controller was completely hosed such that
> > a simple reset didn't reinitialize it sufficiently for it to work.
> > 
> > This morning I booted the box and started a cvsup.  My repository is
> > on a ZFS mirror with the disks on ata3 and ata4.  The system hung after
> > the data from the server were received, although all the data were
> > successfully written to the disks.
> > 
> > I couldn't do anything at all - it looked like the root disk was not
> > responding and the disk light was on solid red.  I had to do a hard
> > reset.
> > 
> > This is the first time I've seen a problem with this port.  The root
> > disk is on ata2.
> > 
> > I rebooted and turned off MSI.  I'll monitor the situation to see
> > whether that helps.
> 
> I don't mean to hijack your thread, but I've had problems with one of
> my SATA disks falling off the bus.  I could usually retrieve it with
> an atacontrol detach / retach.  However, with a recent kernel all I'm
> getting is this:
> 
> ata2: <ATA channel 0> on atapci1
> ata2: AHCI reset...: 2
> ata2: SATA connect time=0ms
> ata2: ready wait time=0ms52 (12272 MB)
> ata2: software reset port 15...
> ata2: ahci_issue_cmd timeout: 100 of 100ms, status=00000001
> ata2: software reset set timeout
> ata2: software reset port 0...
> ata2: ahci_issue_cmd timeout: 100 of 100ms, status=00000001
> ata2: software reset set timeout
> ata2: SIGNATURE: ffffffff
> ata2: Unknown signature, assuming disk device
> ata2: AHCI reset done: devices=00000001
> ata2: [MPSAFE]
> ata2: [ITHREAD]
> 
> One for each channel, up to ata7. 
> 

This is what I see when e.g. ata6 is hosed.  Interesting to see that
not just ATI (780G) has problems.

I tried a detach/retach once, but interesting things happened because
the disk was mounted and I was (stupidly) cd'd to it.

I tried mounting the disk sync today, which may have been helpful.
Hard to say.  I was able to do an online update of the openSUSE which
I have running out of a qemu image on the affected disk and it succeeded.

> atapci0_at_pci0:0:31:1:    class=0x01018a card=0x948115d9 chip=0x269e8086 rev=0x09 hdr=0x00
>     vendor     = 'Intel Corporation'
>     device     = '631xESB/632xESB/3100 Ultra ATA Storage Controller'
>     class      = mass storage
>     subclass   = ATA
> 
> The last known kernel which works was Dec 17, but trying to rebuild a
> kernel from that date doesn't see the SATA disks either (as the kernel
> which sees the disks zfs doesn't work.) Or perhaps I'm csup'ing
> incorrectly. I'm still trying to back up far enough so it will work.


---
Gary Jennejohn
Received on Sat Feb 28 2009 - 20:05:49 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:42 UTC