Re: ZFS checksum errors on umass(4) insertion

From: Richard Todd <rmtodd_at_ichotolot.servalan.com>
Date: Thu, 16 Apr 2009 13:36:48 -0500
Damian Gerow <dgerow_at_afflictions.org> writes:
> 1) Reverting the extended attribute locking change (r189967) does not change
> the situation for me.  I still experience checksum issues and data loss.
> (Unsurprisingly.)
>
> 2) Without umass loaded, I have been completely unable to trigger the issue.
>
> 3) Once umass is loaded, and the symptoms start cropping up, unloading umass
> does not make them go away (again, unsurprisingly).  What I haven't yet
> tested, but am currently working towards, is whether removing umass stops
> further checksum errors from ocurring.
>
> 4) r189967 does remove some LORs for me, even though I don't use (that I
> know of) extended attributes.
>
> 5) It seems that so long as umass is used at all, the symptoms will
> eventually show up.  I've been able to trigger the symptoms by inserting
> then removing a umass device immediately after boot, then ramping up the
> workload.
>
> 6) The only difference made by vfs.zfs.debug=1 is that zfs reclaims are
> logged.
>
> I'm at a bit of a loss as to what to test next, other than checking for an
> increased number of checksum errors after unloading umass.  However, I'm not
> convinced this is going to highlight the actual problem.  I'm all ears as to
> what to test for at this point, as I'm running out of ideas.

I have a question or two, and an idea.  

The questions: 

1) How much RAM do you have, is it 4G or more?  (I'm guessing the
answer is "yes".)

2) What does "sysctl -a | grep bounced" say?  Check this both before and after
loading umass and seeing the bug triggered.

My idea: I suspect a bug in the bounce-buffer code that does I/O to memory
space beyond the area a given piece of hardware can access directly thru DMA.
I've had some similar issues with checksum errors, and they seem to have gone
away since lowering hw.physmem to 3400M in loader.conf, which cuts memory
usage down below the point where anything needs to use bounce buffers. 
You might try lowering hw.physmem and see if that helps; check with the
"sysctl -a | grep bounced" command, you should be seeing something like 

hw.busdma.zone0.total_bounced: 0
hw.busdma.zone1.total_bounced: 0
hw.busdma.zone2.total_bounced: 0

if no bounce-buffer usage is going on.  (The number of zones may be different
on your system.)
Received on Thu Apr 16 2009 - 16:46:29 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:46 UTC