More ATA interrupt(?) oddities

From: Daniel Eriksson <daniel_k_eriksson_at_telia.com>
Date: Thu, 12 Aug 2004 19:21:03 +0200
Sometime during the last 3 weeks, something has changed in the kernel that
makes one of my servers barf when hit with lots of I/O on multiple
filesystems (like when starting "fsck" on all the filesystems). This used to
work in the past, but I have unfortunately not run this stress-test in the
last 3 weeks despite upgrading the kernel multiple times (I know, regression
testing is your friend, I'll be more careful in the future). The problem
predates the last major commit to the ACPI subsystem, so it might not be
related to that. But since it seems to be related to resources handled by
ACPI, I also CC:ed Nate on this.

Attached is the kernel config, the output of "acpidump -t -d" and a file
containing the "boot -v" output. The "boot -v" file also contains a log of
the failure plus some additional info ("vmstat -i", "atacontrol list",
"shutdown -r now").

System:
-------
ASUS A7V600-X (VIA KT600 based) mobo, Athlon XP2500+ CPU, 1.25GB DDR RAM
Tons of discs hooked up to the onboard UDMA133 controller, the onboard SATA
controller, two HighPoint RocketRAID454 controllers and an Adaptec 29160
controller.
5-CURRENT built from sources dated 2004.08.12.13.30.00 (no CFLAGS/COPTFLAGS,
but I use "CPUTYPE?=athlon-xp" in make.conf)
The discs are used either as single discs, striped ataraid arrays, striped
gvinum arrays or mirrored gvinum arrays.

The problem:
------------
* With only /, /usr, /tmp and /var mounted running in multi-user mode, I
start multiple concurrent "fsck -f -t ffs /dev/somefilesystem" on the other
12 filesystems.
* Almost immediately the server starts to spit out timeout errors like this:
ad5: TIMEOUT - READ_DMA retrying (2 retries left) LBA=22858569
* The bootlog_plus_info.txt contains a more detailed log of the errors (it
keeps resetting the ATA controller).

More info:
----------
* Twice I have let the fsck processes continue despite the errors, and after
about a minute the machine locked up hard (nothing on the console, no ddb).
* If I abort the fsck processes, the machine seems to continue to work
properly.
* If I run fsck on the filesystem one after another, I get no errors at all.
* It does not seem to be related to SATA, since it happens whether or not I
run fsck on the SATA disc (I unhooked the second SATA disc because I've had
major problems with the machine when both are connected).
* I've tried this with and without "device apic" in the kernel config - it
seems to make no difference.
* A kernel from 2004.08.09.13.00.00 shows the same symptoms, so the bug was
introduced before that date.
* I will try to build a few more kernels from even older sources to see if I
can more closely pinpoint the date when it was introduced. This might
however take a few days.

/Daniel Eriksson

Received on Thu Aug 12 2004 - 15:21:11 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:05 UTC