Re: FreeBSD corruption problems on barcelona

From: Nick Piggin <nickpiggin_at_yahoo.com.au>
Date: Sun, 11 Nov 2007 08:41:00 +1100
Hi Daniel,

On Saturday 10 November 2007 22:45, Daniel Gerzo wrote:
> Hello Nick,
>
> Saturday, November 10, 2007, 4:21:05 AM, you wrote:
> > Here it is attached. Now there is a cdrom error there, however I
> > don't believe it is the cause of the problem (or at least, there
> > is a bigger problem with the sata disk). The install has run
> > perfectly every time I've run it, so it is pulling the data off
> > the CD OK.
> >
> > Now I have actually got as far as root login, I filled up a 1MB
> > file with /dev/urandom and took an md5. Then copied that to 50
> > files on the /tmp filesystem, unmounted and remounted it, and then
> > read back the md5 sums. Practially all of them are wrong, but they
> > seem to be wrong in the same ways (eg. many share the same
> > incorrect md5 sum). Reading the files back from disk consistently
> > gives the same information, so it seems like reads are OK.
>
> Did you by any chance tried to install some other OS and checked if
> that is really a FreeBSD problem? You mentioned that you've got a new
> box, so I suppose that you tried only FreeBSD on it so far on it.

I have got Linux on it as well, no sign of problems (that doesn't
completely rule out a hardware problem, of course...)


> In the past, I had a bit similar problem. The symptons were that I
> checked some file's md5 hash, then copied it some other location and
> checked the new md5 hash of that file, it was different. The problem
> was resolved after we replaced CPU (AFAIR).

The thing is, the data doesn't get corrupted in the pagecache. If
I copy the files then read them back from cache, everything is fine.
It's only after dumping the pagecache (via unmount and remount),
and reading it back into pagecache, can the corruption be seen.
Subseqent unmounting and remounting shows exactly the same data.

Also, the corruption isn't a usual CPU corruption one like a bitflip
or cachline corruption, but significant blocks of zeroes in the files
(which look like they're page or filesystem block size aligned).
So it seems to be getting corrupted going from pagecache to disk.

It would be pretty unusual if it were a CPU problem, but it could
be other hardware, sure.


> So the things you are describing in your email seem to me more like a
> hardware problem than a FreeBSD problem, could you please run some
> kind of hardware test and try to replace your controllers, sata cable,
> disk and so on?

It's tricky. The controller is built in. Cable and disk I'm reluctant
to replace, given that reads are going across them just fine.

But I can run any specific test that you suggest.

Thanks,
Nick
Received on Sun Nov 11 2007 - 05:31:43 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:21 UTC