Re: PLEASE TEST: IPI deadlock avoidance patch

From: Craig Boston <craig_at_xfoil.gank.org>
Date: Thu, 26 Aug 2004 16:48:17 -0500
On Thu, Aug 26, 2004 at 11:18:34AM -0700, Doug White wrote:
> Okay, for those of you experiencing the data corruption issue, I need to
> know the following:

Sure, I'll do what I can to help track this down.

> . cvsup date & time for the affect kernel(s)

"Sometime yesterday" is the closest I can come up with by memory.
/usr/sup/src-all/checkouts.cvs:RELENG_5 seems to indicate 8/25/2004
05:39 UTC, which sounds right.

> . branch you're tracking

RELENG_5 + IPI patch
(manually reapplied after each cvsup)

> . revision of src/sys/kern/kern_lock.c - I'm checking for a specific set
>   of commits here

/usr/src/sys/kern/kern_lock.c:
     $FreeBSD: src/sys/kern/kern_lock.c,v 1.74 2004/08/16 15:01:22 kan Exp $

> . reproduction case - applications involved and detailed description of
>   the operation(s) involved.

Ok, here's the procedure I did just now to provoke it:

* Boot with known-good kernel (SMP disabled).

* Copy /usr/src to /usr/src2
  My reasoning is that it's usually a file in /usr/src that gets
  corrupted, so it would be good to have a copy to compare against.

* diff -r /usr/src /usr/src2
  Verify good copy.

* Reboot with SMP enabled.

* make buildworld
    (-j or not doesn't seem to make a difference here).

* Wait for buildworld to die or strange things to start happening.  This
  time, I noticed partway through that troff was getting signal 11s.

* Stop the build and reboot.  troff was still corrupt after a reboot,
  ruling out my hunch about buffer cache corruption.

* Reboot with good kernel (SMP disabled)

* diff -r /usr/src /usr/src2
  Aha! some files have changed...  Oddly enough, files in src2 have been
  corrupted, and src2 wasn't being accessed in any way during the
  buildworld.  So it appears that random disk blocks are getting stomped
  on...

One such file is contrib/groff/contrib/groffer/groffer.sh.  Part of the
file has been replaced by postscript code, apparently from texinfo.tex.
The affected section starts at 0xc000 exactly and is exactly 0x10000
(64kb) long.

Additionally, contrib/libstdc++/config/abi/i386-freebsd4/baseline_symbols.txt
had 32k worth replaced with parts of several files: i4b_l3timer.c,
ia64/conf/SKI, ia32_sigtramp.c, several blocks of nulls, and what looks
like a directory entry.  The different 'chunks' with the overwritten
section are all aligned on 2k boundaries.

I'm still running a diff to see what other files were affected.  Most of
them are too big to post, but can be made available should anyone want
them.
 
> It would also be nice if you could set up a serial console and attempt to
> break into the debugger with an NMI, if your system is so equipped. You'll
> want to set these sysctls beforehand:

I'm not sure what this will accomplish -- the deadlocks are gone so I
can access ddb from the console anytime.

> My guess here is that there is another change that got masked by the IPI
> problems that are causing this, and getting SMP usable again has brought
> it into the light.

Possibly.  The only other thing recently changed on there is using
gvinum instead of regular vinum.  But there's no hard evidence either
way -- my stripe size is much bigger than the size of the corrupt
sections, and the frequency of the corrupt files appearing seems to be
about the same as the frequency of deadlocks I had prior to the IPI
patch.  Both facts are completely circumstantial and could easily be
coincidence.  Or it could be something else entirely.

Regards,
Craig
Received on Thu Aug 26 2004 - 19:48:20 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:08 UTC