athlon-xp + fakeraid regression

From: Brian K. White <brian_at_aljex.com>
Date: Fri, 17 Mar 2006 16:53:51 -0500
I don't know who might be interested in this or what the most generally 
acceptable fix will be, "thats a bug we should know about" or just "don't do 
that".
Sometime between 6.0-snap011 and 6.1beta4, something broke CPUTYPE=athlon-xp
I would normally be satisfied with simply "don't use athlon-xp" but, since 
it used to work fine, I see it as a regression and any regression should be 
reported.
Or, forget the regression, consider it evidence instead to more strongly 
depreciate -mcpu/-march = athlon-xp.

One desktop machine has an athlon-xp-2200 and a highpoint rocketraid133 pata 
raid with two identical 60 gig ata100 dirves in a raid0 array.

6.0-snap011 and several preceeding versions all install directly to this 
card without a hitch even though it's not a real hardware raid and even 
though the entire drive is a raid0 striping array with no  boot partition 
that's outside the array etc.. and without having to know how to even spell 
"gmirror" or "vinum". It's beautiful, just select "ar0" in sysinstall easy 
as pie.
For the record, Linux can't do this on these same exact cards.

6.1beta4 installs and runs just fine on it too.

The problem only comes when I put CPUTYPE=athlon-xp in make.conf and build a 
new kernel.
The build completes fine, the kernel boots fine, the machine will seem to be 
fine as long as it remains quiescent.
but type make anything, including the same "make buildkernel" that just 
suceeded a few minutes ago, and you get this:

# make buildworld
Interrupt storm detected on "irq10"; throttling interrupt source
ad4: TIMEOUT - READ_DMA retrying (1 retry left) LBA=88099699
ad4: TIMEOUT - READ_DMA retrying (0 retries left) LBA=88099699
ad4: FAILURE - READ_DMA timed out LBA=88099699
ar0: FAILURE - RAID0 array broken
g_vfs_done():ar0s1a[WRITE(offset=6144000, length=10240)]error = 5
g_vfs_done():ar0s1a[WRITE(offset=65536, length=2048)]error = 5
g_vfs_done():ar0s1a[WRITE(offset=65536, length=2048)]error = 5
g_vfs_done():ar0s1a[WRITE(offset=6144000, length=10240)]error = 5
g_vfs_done():ar0s1a[WRITE(offset=18883952640, length=16384)]error = 5
g_vfs_done():ar0s1a[WRITE(offset=18884395008, length=16384)]error = 5
g_vfs_done():ar0s1a[WRITE(offset=18885148672, length=16384)]error = 5
g_vfs_done():ar0s1a[WRITE(offset=19848953856, length=16384)]error = 5
g_vfs_done():ar0s1a[WRITE(offset=60120072192, length=16384)]error = 5
ad4: WARNING - WRITE_DMA taskqueue timeout - completeing request directly
ad4: WARNING - WRITE_DMA freeing taskqueue zombie request

At this piont the box is 99% locked.
The console reacts to the keyboard (screensaver comes up, spacebar makes it 
go away, ctrl-t, alt-fn etc work) and it responds to pings.
But no programs work (no new shell or login prompts from pressing enter, 
getty doesn't see keystrokes, no network services work)
At the beginning , just after hitting enter on the make command, one of the 
ad4 disk light goes on solid for several seconds.

Those messages look just like messages I've seen before and know a little 
about.
There is a well known thing where these cheap pata fakeraid cards will try 
to do ata133 if the drive says it can, when really, even if he drives are 
new ata133 drives and the cables are new and short and shielded, you still 
shouldn't try to do ata133 since the spec is too tight and you'll just get 
bit errors or other failures. I have several similar boxes with both ITE and 
HighPoint pata fakeraid chips and have seen in almost every case that if I 
don't disable dma entirely in /boot/loader.conf(1), then either immediately 
or eventually, you get WRITE_DMA errors from the ata driver and the box 
reboots. The fix is use ata100 somehow, either by disabling dma entirely in 
loader.conf (since you have no more selective option there, and the raid 
card bios never has an option for controlling pio/dma mode like motherboard 
bios's have) and then use atacontrol in rc.early to set udma5, or by using 
disks that can only do ata100 and only advertise ata100 to the controller.

That's just to show I know about that issue and it's not that. Not 
simply/only that anyway :)

This machine first off, only has ata100 disks and already runs in udma5 
without needing to be forced.
2nd, it ran 6.0-snap-011 from the day it came out, with athlon-xp, performed 
a lot of big makes and rsyncs etc until yesterday without a problem.
So it's not a case of "well maybe this machine should be forced even slower 
to ata66?"

Reinstall 6.1beta4 on the above box and don't use CPUTYPE=athlon-xp, and 
everything is fine. reliable builds and other heavy disk activity.
I also just built a kernel with CPUTYPE=i686 and removed I486_CPU & 
I586_CPU, and it's doing a buildworld just fine under that kernel right now.

Brian K. White  --  brian_at_aljex.com  --  http://www.aljex.com/bkw/
+++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++.
filePro  BBx    Linux  SCO  FreeBSD    #callahans  Satriani  Filk!
Received on Fri Mar 17 2006 - 20:54:20 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:53 UTC