8.0-RC1 amd64 - filesystem on scsi disk related memory corruption w/ gte 4GB ram

From: Mike Watters <mike_at_mwatters.net>
Date: Tue, 20 Oct 2009 16:51:53 -0700
Running 8.0-RC1 on an amd64 workstation, I have run into what appears to
be a memory corruption issue when doing (UFS2) filesystem I/O on an
attached SCSI disk when more than 4GB of RAM is installed, or when 4GB
is installed and "memory hole remapping" is enabled in the BIOS.

The memory modules all pass memtest86+ (e820 map) individually and in
combination.  A linux rescue disc runs fine, and the memtester program
included thereon doesn't complain.  I can scan the disk (using dd) from
that rescue disc without errors.  SCSI verify commands run from the SCSI
card complete successfully.  The disk reports no grown defects and no
SMART failures.  I tried two different U320 cables/terminators and two 
(consumer-class) motherboards.  Underclocking the RAM doesn't resolve 
the issue.

The disk works fine with 2GB (1x2) or 3GB (3x1) of RAM installed, and
with 8GB (4x2) installed with hw.physmem="2GB" set in the loader (tested
that case with remapping off).  "Works" means the following command
completes successfully with root, /usr, and /var mounted r/o:

# find / -type f -exec md5 -q {} \; > /dev/null

When it doesn't work, the following cases tend to occur:

1.  (can't run command): "ROOT MOUNT ERROR" during boot (following
"GEOM: da0s1: invalid disklabel.").

2.  (can't run command): Hang during "Trying to mount root" (not
following a disklabel error).

3.  (booting from livefs cd, can't run command, invalid disklabel error).

4.  (command is running, both from livefs cd and after a successful da0
root mount): g_vfs_done() kernel error messages with high-magnitude
positive and negative offset values.  Input/output errors and invalid
file descriptor errors on specific files.  Eventual panic (most recent
was something equivalent to "GPF while in kernel mode; trap 9 while
interrupts disabled").


For cases 3 and 4, I checked the first 16k of the slice with the
following command:

# dd if=/dev/da0s1 count=32 | md5 -q

The same digest was produced for (3), (4), and in cases which worked
without error (2GB and 4GB w/remapping off, running from livefs cd).

In some of the failing cases, "camcontrol inquiry" would
intermittently return an empty result, e.g. (retyped):

# camcontrol inquiry da0
pass0: <   > Fixed Direct Access SCSI-0 device
pass0: Serial Number
(can't remember 3rd output line)

(Intermittently as in repeating the command during the same session 
could yield the proper result after a number of attempts).

The SCSI card is an LSI20160 (sym(4), PCI U160).  The SCSI disk is a
Seagate ST373455LW (U320).  The current motherboard is an ASUS M3A76-CM
(AMI BIOS, AM2+ socket).  I have a boot -v dmesg (34kb) available from a
livefs cd boot with 8GB installed and memory hole remapping turned on
(case 4 result).  The source used to build the CD was cvsup'ed a week or
two ago.

Hardware common to failing cases:

1. Power supply.
2. CPU itself.
3. SCSI card.
4. SCSI disk.
5. RAM.

The system appears to be stable when not using the SCSI disk.  I would
appreciate any suggestions anyone might have, or confirmation that
someone has the same type of setup working under 8.0-RC1.  I haven't yet 
tried the setup under 7-stable-amd64.


Mike
Received on Tue Oct 20 2009 - 22:19:58 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:57 UTC