interrupt throttling stepping in too soon?

From: Stijn Hoop <stijn_at_win.tue.nl> Date: Thu, 6 Oct 2005 09:53:12 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:44 UTC

Hi,

On 6.0-BETA5 I was initializing a gvinum RAID-5 array of 565GB. This
took about 28 hours, and completed without errors.  Due to a rocky
start while creating this volume and the fact that I've been bitten by
parity errors in the past I decided to do a complete 'checkparity'
right after the array came up. This was yesterday afternoon.

This morning I came in to find the system unresponsive. On the console
there were multiple DMA_TIMEOUT messages for the disks of the array,
and just above those was a line about an 'interrupt storm for atapci0,
throttling'. Sure enough this is the controller that all the disks
are on.

Doubly unfortunate however was that upon attempting to reboot
(CTRL+ALT+DEL still worked), the system dropped into the debugger
after having 'synced all disks'.  Being naive I thought 'oh well I'll
debug after reboot' and rebooted. Apparently this was a stupid thing
to do because my whole / had gone missing. While this has no real
relevance on the problem at hand, it had an unfortunate side effect:
due to the fact that I had to reinstall, I have no actual log messages
anymore :( What I do have are the dmesg lines for the setup, see
below.

Is it possible that the interrupt storm detection and subsequent
throttling is triggering too early, leading to lost interrupts, and in
this case DMA_TIMEOUTS?

As far as I can tell the system worked beautifully while initializing,
and for +- 8 hours checking parity. It is now rebuilding again
(*twiddle thumbs*) and it still seems to work perfectly. I therefore
do not primarily suspect the hardware (although it is of course
possible). It could be that the daily periodic job triggered more I/O,
so much so that the system overloaded (although the relevant
controller has no mounted drives of course).

Relevant dmesg:

atapci0: <Promise PDC20269 UDMA133 controller> port 0xd400-0xd407,0xd000-0xd003,0xb800-0xb807,0xb400-0xb403,0xb000-0xb00f mem 0xed800000-0xed803fff irq 11 at device 13.0 on pci0
ata2: <ATA channel 0> on atapci0
ata3: <ATA channel 1> on atapci0
atapci1: <VIA 8233A UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xa400-0xa40f at device 17.1 on pci0
ata0: <ATA channel 0> on atapci1
ata1: <ATA channel 1> on atapci1
ad0: 38166MB <WDC WD400BB-75AUA1 18.20D18> at ata0-master UDMA100
ad1: 117246MB <Maxtor 6Y120L0 YAR41BW0> at ata0-slave UDMA133
ad2: 117246MB <Maxtor 4G120J6 GAK819K0> at ata1-master UDMA133
ad3: 117246MB <Maxtor 6Y120L0 YAR41BW0> at ata1-slave UDMA133
ad4: 194481MB <Maxtor 6Y200P0 YAR41BW0> at ata2-master UDMA133
ad5: 194481MB <Maxtor 6Y200P0 YAR41BW0> at ata2-slave UDMA133
ad6: 194481MB <Maxtor 6Y200P0 YAR41BW0> at ata3-master UDMA133
ad7: 239372MB <Maxtor 6Y250P0 YAR41BW0> at ata3-slave UDMA133

The array is on disks ad4, ad5, ad6 and ad7 (yes, I know that having
only 1 disk per channel is more effective. Speed is not really an
issue here).

gvinum setup (now reinitializing as you can see):

V data                  State: down     Plexes:       1 Size:        565 GB
P data.p0            R5 State: down     Subdisks:     4 Size:        565 GB
S data.p0.s0            State: I 2%     D: pluto        Size:        188 GB
S data.p0.s1            State: I 2%     D: donald       Size:        188 GB
S data.p0.s2            State: I 2%     D: goofy        Size:        188 GB
S data.p0.s3            State: I 2%     D: mickey       Size:        188 GB

Thanks for any answers.

--Stijn

-- 
A "No" uttered from deepest conviction is better and greater than a
"Yes" merely uttered to please, or what is worse, to avoid trouble.
		-- Mahatma Ghandi