On Thu, Aug 26, 2004 at 11:18:34AM -0700, Doug White wrote: > Okay, for those of you experiencing the data corruption issue, I need to > know the following: Sure, I'll do what I can to help track this down. > . cvsup date & time for the affect kernel(s) "Sometime yesterday" is the closest I can come up with by memory. /usr/sup/src-all/checkouts.cvs:RELENG_5 seems to indicate 8/25/2004 05:39 UTC, which sounds right. > . branch you're tracking RELENG_5 + IPI patch (manually reapplied after each cvsup) > . revision of src/sys/kern/kern_lock.c - I'm checking for a specific set > of commits here /usr/src/sys/kern/kern_lock.c: $FreeBSD: src/sys/kern/kern_lock.c,v 1.74 2004/08/16 15:01:22 kan Exp $ > . reproduction case - applications involved and detailed description of > the operation(s) involved. Ok, here's the procedure I did just now to provoke it: * Boot with known-good kernel (SMP disabled). * Copy /usr/src to /usr/src2 My reasoning is that it's usually a file in /usr/src that gets corrupted, so it would be good to have a copy to compare against. * diff -r /usr/src /usr/src2 Verify good copy. * Reboot with SMP enabled. * make buildworld (-j or not doesn't seem to make a difference here). * Wait for buildworld to die or strange things to start happening. This time, I noticed partway through that troff was getting signal 11s. * Stop the build and reboot. troff was still corrupt after a reboot, ruling out my hunch about buffer cache corruption. * Reboot with good kernel (SMP disabled) * diff -r /usr/src /usr/src2 Aha! some files have changed... Oddly enough, files in src2 have been corrupted, and src2 wasn't being accessed in any way during the buildworld. So it appears that random disk blocks are getting stomped on... One such file is contrib/groff/contrib/groffer/groffer.sh. Part of the file has been replaced by postscript code, apparently from texinfo.tex. The affected section starts at 0xc000 exactly and is exactly 0x10000 (64kb) long. Additionally, contrib/libstdc++/config/abi/i386-freebsd4/baseline_symbols.txt had 32k worth replaced with parts of several files: i4b_l3timer.c, ia64/conf/SKI, ia32_sigtramp.c, several blocks of nulls, and what looks like a directory entry. The different 'chunks' with the overwritten section are all aligned on 2k boundaries. I'm still running a diff to see what other files were affected. Most of them are too big to post, but can be made available should anyone want them. > It would also be nice if you could set up a serial console and attempt to > break into the debugger with an NMI, if your system is so equipped. You'll > want to set these sysctls beforehand: I'm not sure what this will accomplish -- the deadlocks are gone so I can access ddb from the console anytime. > My guess here is that there is another change that got masked by the IPI > problems that are causing this, and getting SMP usable again has brought > it into the light. Possibly. The only other thing recently changed on there is using gvinum instead of regular vinum. But there's no hard evidence either way -- my stripe size is much bigger than the size of the corrupt sections, and the frequency of the corrupt files appearing seems to be about the same as the frequency of deadlocks I had prior to the IPI patch. Both facts are completely circumstantial and could easily be coincidence. Or it could be something else entirely. Regards, CraigReceived on Thu Aug 26 2004 - 19:48:20 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:08 UTC