data corruption with current (maybe sis chipset related?)

From: Heiko Schaefer <hschaefer_at_fto.de>
Date: Tue, 6 May 2003 16:41:30 +0200 (CEST)
Hi List,

i already brought up my issue with data corruption when i suspected that
gbde might be the cause for it. it turns out gbde was not guilty.

then just now i thought that M. Warner Losh's mail (subject 'Precaution')
could explain what's going wrong, but i still have the problem with a new
world and kernel as of today.

i can reproduce the data corruption by doing the following:
(i have two disks in the box, one is 30GB, the other is 60GB)

the 30GB disk is already filled with data, which i then copy (in parallel)
into two directories on the 60GB disk. the result is that i (should) have
two times the same data on the 60gb disk as on the 30gb disk.

then i compute the checksums of the duplicated files - and more often than
not a few files are corrupted in the copied version.

more numbers: i typically see 1-2 blocks of corrupted data (32kb is the
size of corrupted data i usually see) on that 60gb disk. usually the
corruptions are aligned within the file, at least to a multiple of 512.
often, the corrupted data consists of lots of 0-bytes, but i also see data
that looks random in other places of the corrupted segments of the files.
it seems that not only the content of files gets corrupted, i also see
errors when i fsck that partition, sometimes (for example: once i saw a
file that had a size in "ls -l" which clearly didn't match its actual
content, as seen by "wc").

by now i have ruled out a number of possible reasons:
- i am only using local disks (no networking as i did initially)
- first i used a 512mb ddr memory, now i use a 256mb sdr one, which should
  (i believe) have different enough properties to rule out the original
  memory as the cause of the problem

as i see it, the issue can be hardware-related (mainboard/cpu seem to be
the only remaining possibilities) or software related (maybe the driver
for the chipset, in particular the harddrive controller, is suboptimal ?),
or maybe come other freebsd code that moves around data makes occasional
mistakes.

the board is an elitegroup k7-s5a lan (with sis 735 chipset), the cpu an
amd xp 1800+ (i specifically bought that hardware very recently to run a
gbde-based nfs server on it).

does anyone know of any (freebsd-current) issues that might be causing
this - or have any idea on how i can further rule out anything of this
kind ?

my best idea at this point is to go out and buy p4-board and cpu.
and i don't really like that. it seems almost futile to go and get another
board/cpu of the same type before i have a good idea what is actually
wrong :(

regards, thanks for any thoughts,

Heiko

-- 
Free Software. Why put up with inferior code and antisocial corporations?
http://www.gnu.org/philosophy/why-free.html
Received on Tue May 06 2003 - 05:41:41 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:06 UTC