Hi, On 6.0-BETA5 I was initializing a gvinum RAID-5 array of 565GB. This took about 28 hours, and completed without errors. Due to a rocky start while creating this volume and the fact that I've been bitten by parity errors in the past I decided to do a complete 'checkparity' right after the array came up. This was yesterday afternoon. This morning I came in to find the system unresponsive. On the console there were multiple DMA_TIMEOUT messages for the disks of the array, and just above those was a line about an 'interrupt storm for atapci0, throttling'. Sure enough this is the controller that all the disks are on. Doubly unfortunate however was that upon attempting to reboot (CTRL+ALT+DEL still worked), the system dropped into the debugger after having 'synced all disks'. Being naive I thought 'oh well I'll debug after reboot' and rebooted. Apparently this was a stupid thing to do because my whole / had gone missing. While this has no real relevance on the problem at hand, it had an unfortunate side effect: due to the fact that I had to reinstall, I have no actual log messages anymore :( What I do have are the dmesg lines for the setup, see below. Is it possible that the interrupt storm detection and subsequent throttling is triggering too early, leading to lost interrupts, and in this case DMA_TIMEOUTS? As far as I can tell the system worked beautifully while initializing, and for +- 8 hours checking parity. It is now rebuilding again (*twiddle thumbs*) and it still seems to work perfectly. I therefore do not primarily suspect the hardware (although it is of course possible). It could be that the daily periodic job triggered more I/O, so much so that the system overloaded (although the relevant controller has no mounted drives of course). Relevant dmesg: atapci0: <Promise PDC20269 UDMA133 controller> port 0xd400-0xd407,0xd000-0xd003,0xb800-0xb807,0xb400-0xb403,0xb000-0xb00f mem 0xed800000-0xed803fff irq 11 at device 13.0 on pci0 ata2: <ATA channel 0> on atapci0 ata3: <ATA channel 1> on atapci0 atapci1: <VIA 8233A UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xa400-0xa40f at device 17.1 on pci0 ata0: <ATA channel 0> on atapci1 ata1: <ATA channel 1> on atapci1 ad0: 38166MB <WDC WD400BB-75AUA1 18.20D18> at ata0-master UDMA100 ad1: 117246MB <Maxtor 6Y120L0 YAR41BW0> at ata0-slave UDMA133 ad2: 117246MB <Maxtor 4G120J6 GAK819K0> at ata1-master UDMA133 ad3: 117246MB <Maxtor 6Y120L0 YAR41BW0> at ata1-slave UDMA133 ad4: 194481MB <Maxtor 6Y200P0 YAR41BW0> at ata2-master UDMA133 ad5: 194481MB <Maxtor 6Y200P0 YAR41BW0> at ata2-slave UDMA133 ad6: 194481MB <Maxtor 6Y200P0 YAR41BW0> at ata3-master UDMA133 ad7: 239372MB <Maxtor 6Y250P0 YAR41BW0> at ata3-slave UDMA133 The array is on disks ad4, ad5, ad6 and ad7 (yes, I know that having only 1 disk per channel is more effective. Speed is not really an issue here). gvinum setup (now reinitializing as you can see): V data State: down Plexes: 1 Size: 565 GB P data.p0 R5 State: down Subdisks: 4 Size: 565 GB S data.p0.s0 State: I 2% D: pluto Size: 188 GB S data.p0.s1 State: I 2% D: donald Size: 188 GB S data.p0.s2 State: I 2% D: goofy Size: 188 GB S data.p0.s3 State: I 2% D: mickey Size: 188 GB Thanks for any answers. --Stijn -- A "No" uttered from deepest conviction is better and greater than a "Yes" merely uttered to please, or what is worse, to avoid trouble. -- Mahatma Ghandi
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:44 UTC