Re: SATA disks suddenly stop working

From: Gary Jennejohn <gary.jennejohn_at_freenet.de>
Date: Thu, 26 Feb 2009 12:22:12 +0100
On Wed, 25 Feb 2009 21:56:38 +0200
Alexander Motin <mav_at_FreeBSD.org> wrote:

> Gary Jennejohn wrote:
> > I've been having lots of problems with SATA drives attached to higher
> > port numbers, namely ata5 and ata6.
> > 
> > I was installing Linux under qemu today and it had been running for
> > several hours and had installed multi-gigabytes of data when qemu
> > just stopped.
> > 
> > I noticed that all I/O to the disk had ceased.
> > 
> > Doing "atacontrol reinit" on the port (ata5) resulted in a message
> > that the device was not configured, which was patently false since
> > qemu had just been merrily writing to it.
> > 
> > This with a kernel made from sources updated today at about 2 PM (GMT+1).
> > 
> > I've also seen problems with a disk attached to ata6.  It just sort
> > of disappears after a while.
> > 
> > Disks attached to ata2, ata3 and ata4 don't exhibit any problems.
> 
> You have told much and same time gave nothing that can be used.
> 

I was only interested in whether others have seen this problem.  I was
not looking for a solution.

> What controller do you have? What drives on what channels? Is there any 
> kernel messages about the problem? Have you tried to enable verbose 
> messages to get additional details?
> 

atapci0_at_pci0:0:17:0:    class=0x010601 card=0xb0021458 chip=0x43911002 rev=0x00 hdr=0x00
    vendor     = 'ATI Technologies Inc'
    class      = mass storage
    subclass   = SATA

There were no kernel messages at all, the drive simply hung.

I'll do a verbose boot and try to reproduce the disk hang later.

> Reinit could return ENXIO if it already was in progress. Disappearing 
> drives are also can be related to that reinit. Can't it be just a real 
> hardware problem?
> 

I should have mentioned that the error returned was about some IOCTL.
Can't remember which one right now, but the error message did include
that the device was not configured.

I've also noticed several times in the past when the problem occurred
that the BIOS could not enumerate the AHCI disks anymore.  I had to
do a POR.  Seems that the controller was completely hosed such that
a simple reset didn't reinitialize it sufficiently for it to work.

This morning I booted the box and started a cvsup.  My repository is
on a ZFS mirror with the disks on ata3 and ata4.  The system hung after
the data from the server were received, although all the data were
successfully written to the disks.

I couldn't do anything at all - it looked like the root disk was not
responding and the disk light was on solid red.  I had to do a hard
reset.

This is the first time I've seen a problem with this port.  The root
disk is on ata2.

I rebooted and turned off MSI.  I'll monitor the situation to see
whether that helps.

---
Gary Jennejohn
Received on Thu Feb 26 2009 - 10:22:15 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:42 UTC