Re: data corruption with current (maybe sis chipset related?)

From: Heiko Schaefer <hschaefer_at_fto.de>
Date: Sat, 10 May 2003 20:44:22 +0200 (CEST)
Hey Terry,

> > > walt wrote:
> > > > Do I recall from some months ago that this bug would not
> > > > affect machines with less than a gig of RAM?
> > >
> > > The amount of memory at which you see it depends on the processor
> > > features.  Now that autotuning is in, there's a stair-step for
> > > how much the system uses for each resource pool, based on how
> > > much RAM is in the system.  It's quite unpredictable where it will
> > > show up in -current, because of this (and the new memory allocator).
> > >
> > > Basically, the problem will show wherever the memory size vs.
> > > memory utilization tickles it (that's why upping maxfiles was
> > > enough to scare it off, before the tuning/allocator changes
> > > went in).

> > - i still have an issue with the system because of which i started this
> > thread:
> >
> > originally, i bought a 512mb ddr ram for it (not the cheapest kind, but
> > also nothing fancy - the chips say infineon). with that ram i still
> > experience data corruption.
> >
> > while i reported that the problem disappeared, i was running of a sdr pc
> > 133 ram which is only 256mb.
> >
> > what i wonder now: is the physical 512mb ram possibly damaged (or not
> > interacting well with the board or bios), or could that yet again be a
> > general (software-solvable) issue (which i would likely experience
> > whenever i have 512mb of ram in that machine. regardless of make) ?
>
> It's possible that the RAM was damaged, but unlikely.
>
> If you revert to a DP2 kernel (or any kernel before Jeff's
> allocator changes AND Matt's autotuning changes), you should
> be able to trigger this problem fairly easily with anything
> that causes a lot of page thrashing right after system boot,
> as long as you pick the right amount of RAM to install for
> the CPU features of the CPU you are using.
>
> > if the problem is likely to go away with another 512mb ram, i will go to
> > get the ram changed on monday - otherwise, i'd like to spare myself and
> > the vendor the trouble :) ... especially myself *g*
>
> It might.  It might not.  When I first saw the problem, it
> didn't occur on 512M, and it didn't occur on 2G, but it did
> occur on 1G.  This was a SuperMicro running a PIII.  The
> behaviour's going to be different for different CPU features,
> unfortunately.

i'm sorry, my mail was probably a bit confusing.
since it has been pointed out to me, i am running -current kernels with

options               DISABLE_PSE
options               DISABLE_PG_G

enabled.

what i am asking myself:
is there any chance that i still get any data corruption because of the
issues that you write about in some configuration ?!

because with the 512mb (ddr) ram (which might or might not be defective) i
get data corruption, while with another 256mb (sdr) ram, i apparently
don't.

so far i had the impression that my test (copying >30gb of checksummed
data between disks) shows these problems rather reliably.

> Alternately, disable auto-tuning by setting MAXUSERS to some
> value (preferrably equal to or larger than the pre-auto-tune
> value), and then set maxfiles to 50000 or more.  This should
> also mask the problem (though I don't know this for sure,
> given Jeff's allocator changes not preallocating the page
> maps for things which used to be allocated via zalloci()).

masking sounds scary to me - i don't really want to make the problem less
likely by, say 1 : 10^3 or so :)
i would much rather not have any data corrupted at all.

> > does it make sense for me to try bosko's patch ?
>
> Yes.  It fixes the problem, according to his testing.  He
> posted the URL for it a while back, or you can contact him
> directly.

ok, i'll find it - what i wanted to ask is, if that patch is likely to
make _more_ problems go away than those two kernel options.

> > can i hope for any better results (i don't really care about
> > performance, only data integrity) with it than with those
> > two kernel options ?!
>
> Yes, if that's the source of your problems.  As you pointed
> out, there's a small but finite chance it's bad RAM, or a
> problem with the motherboard, etc..  The way to find out is
> to try the offending RAM again, with a kernel with those
> options, and see if it happens (this assumes that you were
> able to trigger it fairly reliably before; negative evidence
> is really only anecdotal, without a regression test case, so
> if it only happened one in a great while, it not happening in
> a week or a month would prove nothing).

i guess i can manage to get another 256mb sdr ram into that box
temporarily by next week, if nothing better comes up - just to check.

thanks, regards,

Heiko

-- 
Free Software. Why put up with inferior code and antisocial corporations?
http://www.gnu.org/philosophy/why-free.html
Received on Sat May 10 2003 - 09:44:28 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:07 UTC