Re: still: Re: gbde data corruption?

From: Heiko Schaefer <hschaefer_at_fto.de> Date: Wed, 30 Apr 2003 16:14:20 +0200 (CEST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:05 UTC

Hello Poul,

> >the broken version of the file contains lots of 0-bytes (instead of high
> >entropy values in the original file). seems by the output of cmp that
> >every damaged value is replaced by 0.
>
> Zero bytes is the absolutely last thing I would expect...
>
> How long are the sequences of zero bytes, and do they start at
> sector boundaries ?

it seems that the (one and only) sequence is exactly 32k long and starts
nicely alligned (alligned to 1024*16, even).

> Do you also see this on the client ?  (Ie: could it be that data is
> still cached on the client and not flushed ?)

i see the broken variant of the file both locally and via my nfs client.
which is to be expected - i'm moving rather large amounts of data...

the thing that i am doing (over and over again) is completely filling one
30gb and one 60gb filesystem.

> What is the approximate error-rate ?  1 file in 10 ? 1 file in 100 ?
> How long are the files ?

this last error i observe is one file on a 30gb filesystem that is filled
fully with files that are between 1mb and 10mb or so (most of them, at
least). so i'm talking about 1 in 10000, in this case.

> >another thing i just notice: /var/log/messages contains lots of
> >
> >[...]
> >Apr 30 15:24:55 zoidberg kernel: ENOMEM 0xc4c62100 on 0xc45c6c80(ad2s1e.bde)
> >Apr 30 15:25:19 zoidberg kernel: ENOMEM 0xc3fa5000 on 0xc45c6c80(ad2s1e.bde)
> >Apr 30 15:25:57 zoidberg kernel: ENOMEM 0xc4b46100 on 0xc45c6c80(ad2s1e.bde)
> >Apr 30 15:25:57 zoidberg kernel: ENOMEM 0xc4364500 on 0xc45c6c80(ad2s1e.bde)
> >[...]
>
> This means that the kernel ran out of ram and the operation was retried,
> it should not result in data corruption but it may reorder bio requests
> significantly.  I must admit that I have not bashed NFS to see that it
> copes.

that sounds moderately suspicious to me. i could try to physically move
another disc with lots of unencrypted data into the fileserver and try
copying onto gbde without nfs - but only later today, when i get home.

> >if you have no other things i could report or try, i might just throw away
> >the gbde volumes and try the same copying with non-gbde partitions, just
> >to be sure.
>
> That would be a good first step, but we need to do it controlled to make
> sure we know what we prove, so please try it this way:
>
> add
> 	option          MALLOC_MAKE_FAILURES
> to your kernel.
>
> Build filesystem without GBDE, run test, check for corruption.

well, i think i'll just try copying (over nfs) onto unencrypted
filesystems without any further changes first. one of these copy- and
checksum cycles takes quite a few hours ... if that test results in
errors, then i will instantly throw myself into the dust before you and
apologize :) if not, i'll try to stress my box some more (including malloc
failures if nothing else helps/hurts).

thanks, regards,

Heiko