Sometime during the last 3 weeks, something has changed in the kernel that makes one of my servers barf when hit with lots of I/O on multiple filesystems (like when starting "fsck" on all the filesystems). This used to work in the past, but I have unfortunately not run this stress-test in the last 3 weeks despite upgrading the kernel multiple times (I know, regression testing is your friend, I'll be more careful in the future). The problem predates the last major commit to the ACPI subsystem, so it might not be related to that. But since it seems to be related to resources handled by ACPI, I also CC:ed Nate on this. Attached is the kernel config, the output of "acpidump -t -d" and a file containing the "boot -v" output. The "boot -v" file also contains a log of the failure plus some additional info ("vmstat -i", "atacontrol list", "shutdown -r now"). System: ------- ASUS A7V600-X (VIA KT600 based) mobo, Athlon XP2500+ CPU, 1.25GB DDR RAM Tons of discs hooked up to the onboard UDMA133 controller, the onboard SATA controller, two HighPoint RocketRAID454 controllers and an Adaptec 29160 controller. 5-CURRENT built from sources dated 2004.08.12.13.30.00 (no CFLAGS/COPTFLAGS, but I use "CPUTYPE?=athlon-xp" in make.conf) The discs are used either as single discs, striped ataraid arrays, striped gvinum arrays or mirrored gvinum arrays. The problem: ------------ * With only /, /usr, /tmp and /var mounted running in multi-user mode, I start multiple concurrent "fsck -f -t ffs /dev/somefilesystem" on the other 12 filesystems. * Almost immediately the server starts to spit out timeout errors like this: ad5: TIMEOUT - READ_DMA retrying (2 retries left) LBA=22858569 * The bootlog_plus_info.txt contains a more detailed log of the errors (it keeps resetting the ATA controller). More info: ---------- * Twice I have let the fsck processes continue despite the errors, and after about a minute the machine locked up hard (nothing on the console, no ddb). * If I abort the fsck processes, the machine seems to continue to work properly. * If I run fsck on the filesystem one after another, I get no errors at all. * It does not seem to be related to SATA, since it happens whether or not I run fsck on the SATA disc (I unhooked the second SATA disc because I've had major problems with the machine when both are connected). * I've tried this with and without "device apic" in the kernel config - it seems to make no difference. * A kernel from 2004.08.09.13.00.00 shows the same symptoms, so the bug was introduced before that date. * I will try to build a few more kernels from even older sources to see if I can more closely pinpoint the date when it was introduced. This might however take a few days. /Daniel Eriksson
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:05 UTC