Re: data corruption with current (maybe sis chipset related?)

From: Terry Lambert <tlambert2_at_mindspring.com>
Date: Sat, 10 May 2003 11:02:24 -0700
Heiko Schaefer wrote:
> Hi Terry,
> > walt wrote:
> > > Do I recall from some months ago that this bug would not
> > > affect machines with less than a gig of RAM?
> >
> > The amount of memory at which you see it depends on the processor
> > features.  Now that autotuning is in, there's a stair-step for
> > how much the system uses for each resource pool, based on how
> > much RAM is in the system.  It's quite unpredictable where it will
> > show up in -current, because of this (and the new memory allocator).
> >
> > Basically, the problem will show wherever the memory size vs.
> > memory utilization tickles it (that's why upping maxfiles was
> > enough to scare it off, before the tuning/allocator changes
> > went in).
> 
> you seem to have a pretty good idea of how and when this bug shows itself

Yes.  I made machdep.c modifications at one point, and had a
particular usage model that cause the bug to trigger 100% of
the time, reliably, *after* the system was fully booted, as
a result of a particular operation that I triggered from user
space.  As a result, I had a perfect test bed for characterizing
it.  It took me about two weeks of looking for something other
than a CPU bug before I said "assume a CPU bug *here*", and was
able to work around it in a day.


> - i still have an issue with the system because of which i started this
> thread:
> 
> originally, i bought a 512mb ddr ram for it (not the cheapest kind, but
> also nothing fancy - the chips say infineon). with that ram i still
> experience data corruption.
> 
> while i reported that the problem disappeared, i was running of a sdr pc
> 133 ram which is only 256mb.
> 
> what i wonder now: is the physical 512mb ram possibly damaged (or not
> interacting well with the board or bios), or could that yet again be a
> general (software-solvable) issue (which i would likely experience
> whenever i have 512mb of ram in that machine. regardless of make) ?

It's possible that the RAM was damaged, but unlikely.

If you revert to a DP2 kernel (or any kernel before Jeff's
allocator changes AND Matt's autotuning changes), you should
be able to trigger this problem fairly easily with anything
that causes a lot of page thrashing right after system boot,
as long as you pick the right amount of RAM to install for
the CPU features of the CPU you are using.


> if the problem is likely to go away with another 512mb ram, i will go to
> get the ram changed on monday - otherwise, i'd like to spare myself and
> the vendor the trouble :) ... especially myself *g*

It might.  It might not.  When I first saw the problem, it
didn't occur on 512M, and it didn't occur on 2G, but it did
occur on 1G.  This was a SuperMicro running a PIII.  The
behaviour's going to be different for different CPU features,
unfortunately.


> just to reiterate: the cpu in question is and amd xp 1800+, the board is a
> cheap sis-based elitegroup board (which is why i already initially
> suspected damaged hardware that i need to get exchanged - i typically have
> more faith in freebsd doing the right thing than cheap pc hardware).

If you can borrow the RAM, go ahead and borrow it and try it;
I wouldn't spend unrecoverable money on the bet.  If the worst
case, you can still enable the options, and you'll have more
RAM, and you can live with that, well, that's different.

Alternately, disable auto-tuning by setting MAXUSERS to some
value (preferrably equal to or larger than the pre-auto-tune
value), and then set maxfiles to 50000 or more.  This should
also mask the problem (though I don't know this for sure,
given Jeff's allocator changes not preallocating the page
maps for things which used to be allocated via zalloci()).


> does it make sense for me to try bosko's patch ?

Yes.  It fixes the problem, according to his testing.  He
posted the URL for it a while back, or you can contact him
directly.


> can i hope for any better results (i don't really care about
> performance, only data integrity) with it than with those
> two kernel options ?!

Yes, if that's the source of your problems.  As you pointed
out, there's a small but finite chance it's bad RAM, or a
problem with the motherboard, etc..  The way to find out is
to try the offending RAM again, with a kernel with those
options, and see if it happens (this assumes that you were
able to trigger it fairly reliably before; negative evidence
is really only anecdotal, without a regression test case, so
if it only happened one in a great while, it not happening in
a week or a month would prove nothing).

-- Terry
Received on Sat May 10 2003 - 09:03:42 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:07 UTC