Re: zpool scrub errors on 3ware 9550SXU

From: ian j hart <ianjhart_at_ntlworld.com> Date: Mon, 15 Jun 2009 08:44:06 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:49 UTC

On Monday 15 June 2009 03:12:41 Freddie Cash wrote:
> On Sun, Jun 14, 2009 at 6:27 AM, ian j hart <ianjhart_at_ntlworld.com> wrote:
> > On Sunday 14 June 2009 09:27:22 Freddie Cash wrote:
> > > On Sat, Jun 13, 2009 at 3:11 PM, ian j hart <ianjhart_at_ntlworld.com>
> >
> > wrote:
> > > > [long post with long lines, sorry]
> > > >
> > > > I have the following old hardware which I'm trying to make into a
> >
> > storage
> >
> > > > server (back story elided).
> > > >
> > > > Tyan Thunder K8WE with dual Opteron 270
> > > > 8GB REG ECC RAM
> > > > 3ware/AMCC 9550SXU-16 SATA controller
> > > > Adaptec 29160 SCSI card -> Quantum LTO3 tape
> > > > ChenBro case and backplanes.
> > > > 'don't remember' PSU. I do remember paying £98 3 years ago, so not
> >
> > cheap!
> >
> > > > floppy
> > > >
> > > > Some Seagate Barracuda drives. Two old 500GB for the O/S and 14 new
> >
> > 1.5TB
> >
> > > > for
> > > > data (plus some spares).
> > > >
> > > > Astute readers will know that the 1.5TB units have a chequered
> > > > history.
> > > >
> > > > I went to considerable effort to avoid being stuck with a bricked
> > > > unit, so imagine my dismay when, just before I was about to post
> > > > this, I discovered there's a new issue with these drives where they
> > > > reallocate sectors, from new.
> > > >
> > > > I don't want to get sucked into a discussion about whether these
> > > > disks are faulty or not. I want to examine what seems to be a
> > > > regression between 7.2-RELEASE and 8-CURRENT. If you can't resist,
> > > > start a thread
> >
> > in
> >
> > > > chat and CC
> > > > me.
> > > >
> > > > Anyway, here's the full story (from memory I'm afraid).
> > > >
> > > > All disks exported as single drives (no JBOD anymore).
> > > > Install current snapshot on da0 and gmirror with da1, both 500GB
> > > > disks. Create a pool with the 14 1.5TB disks. Raidz2.
> > >
> > > Are you using a single raidz2 vdev using all 14 drives?  If so, that's
> > > probably (one of) the source of the issues.  You really shouldn't use
> >
> > more
> >
> > > than 8 or 9 drives in a singel raidz vdev.  Bad things happen.
> >
> >  Especially
> >
> > > during resilvers and scrubs.  We learned this the hard way, trying to
> > > replace a drive in a 24-drive raidz2 vdev.
> > >
> > > If possible, try to rebuild the pool using multiple, smaller raidz (1
> > > or
> >
> > 2)
> >
> > > vdevs.
> >
> > Did you post this issue to the list or open a PR?
>
> No, as it's a known issue with ZFS itself, and not just the FreeBSD port.
>
> > This is not listed in zfsknownproblems.
>
> It's listed in the OpenSolaris/Solaris documentation, best practises
> guides, blog posts, and wiki entries.

I have the Administration guide (June 2009). Page 64

...configuration with 14 disks is better split into a (sic) two 7-disk groupings...single-digit groupings of disks should perform better.

This implies it works.

Can you point to the small print, my GoogleFoo is weak.

>
> > Does opensolaris have this issue?
>
> Yes.

Anyway, I broke up the pool into two groups as you suggested.

As usual scrubs cleanly on 7.2. Started throwing errors within a few minutes under 8. Then it paniced, possibly due to scrub -s.

It's sat at the DB prompt if there's anything I can do. I'll need idiots guide level instruction. I have a screen dump if someone want to step up. Off list?

Highlight seems to be...

Memory modified after free 0xffffff0004da0c00(248) val=3000000 _at_ 0xffffff0004dc00
Panic: most recently used by none

Cheers

-- 
ian j hart