Re: 5.1-RELEASE TODO

From: Peter Wemm <peter_at_wemm.org>
Date: Tue, 13 May 2003 15:41:54 -0700
Peter Wemm wrote:
> Don Lewis wrote:
> > On 13 May, Robert Watson wrote:
> > > 
> > > On Tue, 13 May 2003, Heiko Schaefer wrote:
> > > 
> > >> > That said, we are actively discussing what, if any, workarounds are
> > >> > appropriate, including some alternative workarounds from the ones
> > >> > currently present.
> > >> 
> > >> bosko (who was mentioned here various time, regarding a patch to work
> > >> around this) has contacted me, and i am looking forward to try his
> > >> patch.  assuming that the patch is correct (whatever that would mean in
> > >> this context), and there is some chance of accepting it anytime soon,
> > >> maybe it would be sensible to try to get that into the release - or
> > >> delay the release until this is sorted out ?! 
> > >> 
> > >> wouldn't a release that corrupts data in many, relevant, cases (i
> > >> consider the box i had the trouble with entirely mainstream) be worse
> > >> than no release at all? 
> > > 
> > > You don't need to argue to me that we need stability (I'm a fan of it
> > > myself): what I need is evidence that some set of changes is actually
> > > solving the problem, not masking it.  If there exists a patch that
> > > substantially improves stability on some set of systems (and not at the
> > > cost of another set), I think you can rest assured that we'll get it into
> > > the release.  As with you, we're very concerned by the recent spate of
> > > instability, especially in the beta cycle, and how to address that is ver
    y
> > > much on our minds. 
> > 
> > Both my AMD system running -current and PII system running -stable are
> > afflicted with these data corruption problems.  The limited amount of
> > information that I've seen about these problems leads me to believe that
> > in order to use the 4 MB page feature without danger to system integrity
> > is to relocate the kernel.  If this is the case, then it would seem to
> > make sense to disable the use of 4 MB pages by adding the DISABLE_PSE
> > option until the system is patched.
> 
> The thing is, we only use 4MB pages for two things.
> 1) The first 4MB of KVM is mapped as a 4MB page.
> 2) Large device mappings, eg: the Xserver mmaping /dev/mem for the frame
> buffer.  The thing is though, these 4MB pages are not mapped with PG_G.
> 
> Are you running X?  Are you using the broadcom ethernet driver?
> 
> Also of note:  I recently saw a brand new P4 system with a genuine intel
> motherboard, for a RELENG_4 system.  It had shocking data corruption
> problems. The memory was swapped - no change.  The motherboard and CPU were
> swapped (same motherboard model, much newer P4 cpu stepping) - no change.
> It was simply unreliable.  Backporting DISABLE_PG_G to RELENG_4 and turning
> on it and DISABLE_PSE greatly reduced the problem, but it still happened.
> In the end, the Intel motherboard was replaced with a P4 Xeon system
> motherboard and the problem instantly went away.  The trouble appeared
> to be a generic problem the Intel 845 chipset motherboard.
> 
> Remember, this was RELENG_4 as of a few months ago.  It isn't a 5.x-only
> problem.
> 
> The bge driver has been firmly implicated in one of the cases of data
> corruption.  Paul's recent if_bge fixes completely solved one person's
> long-standing problems.  There are DMA bugs in the earlier chipsets that
> we didn't have the prescribed workarounds for.  And even though the compiles
> were happening on local disks, all it took was running the build in an Xterm
> so that the make output was going over the network, or doing a tail -f etc.
> 
> > PG_G is probably different.  A better case can be made that using this
> > option is only masking software bugs that should be fixable.  The
> > problem is that these bugs are only rarely triggered, look a lot like
> > flakey hardware, and it's just about impossible for most FreeBSD users
> > to track the problem to its root cause.
> 
> For what its worth, we have #ifdef'ed code in i386/pmap.c:
> #ifdef I686_CPU_not     /* Problem seems to have gone away */
>         /* Deal with un-resolved Pentium4 issues */
>         if (cpu_class == CPUCLASS_686 &&
>             strcmp(cpu_vendor, "GenuineIntel") == 0 &&
>             (cpu_id & 0xf00) == 0xf00) {
>                 printf("Warning: Pentium 4 cpu: PG_G disabled (global flag)\n
    ");
>                 pgeflag = 0;
>         }
> #endif
> 
> I really do not want DISABLE_PSE and DISABLE_PG_G turned on for what appears
> to have a hardware component.  I'd much rather the above ifdef's turned on.
> 
> For the folks having problems, here's what I'd like to know:
> 
> - Are you running X?  (standard XFree86 or do you have the agp and drm driver
    s
> enabled?)
> - What ethernet card?  (particularly if bge)
> - Is there any network traffic at the time?  ie: if you remove the network
> card entirely and do the compile tests on a /dev/ttyv0 console, does it still
> happen?
> - What hardware do you have?  (cpuid line shoing the Id = 0xNNN number,
> memory size/type and whether it has ECC or not, motherboard chipset, etc)
> - Have you replaced any hardware?  If so, which parts?

Oh, and two more things:
  - Do DISABLE_PG_G and/or DISABLE_PSE actually affect the stability?
  - Are you seeing application faults (segfault etc) or kernel stability
    (fatal trap, panic etc).

Cheers,
-Peter
--
Peter Wemm - peter_at_wemm.org; peter_at_FreeBSD.org; peter_at_yahoo-inc.com
"All of this is for nothing if we don't go to the stars" - JMS/B5
Received on Tue May 13 2003 - 13:41:55 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:07 UTC