Strange disk problem

From: David Ehrmann <ehrmann_at_gmail.com> Date: Mon, 19 Apr 2010 23:57:13 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:02 UTC

Initially, I noticed a problem where reading a file on this machine 
seemed to stop--something like a video would just stop playing.  At 
first, I thought it was the machine, but a new motherboard, CPU, and RAM 
later, the problem persists.  The network card uses a different chipset, 
too.

The files are on zfs, but scrubs are fine, and zpool status lists no 
errors of any kind.  Trying to reproduce the problem, I set up a script 
that reading a random 1M block every 60 seconds off the drive backing 
zfs.  That's when I noticed something: one disk seems to be causing the 
problems.  I logged the dd times, and some of them were huge--more than 
a minute.  The times on the other disk in the mirrored vdev were low.

I've only seen the problem when I have a vm's disk image hosted on the 
machine.  That said, the network interface is configured at 100mbps, so 
there's no reason for that to saturate the disk's throughput.  Top 
reports that almost 20% of the CPU is going towards interrupts.  I can 
read a file off the zfs pool at over 50MB/s, so that shouldn't be a 
problem.  One thing I'm wondering is why the disk read doesn't timeout 
quickly?  At least that way zfs could try to use the other drive in the 
mirrored vdev.

Any ideas?  One thing I should try is switching the drive, see if the 
problem follows the disk or stays with the lowest /dev/adX device.  I'm 
using geli, but the read problems happen with both /dev/adX AND 
/dev/adX.eli., so I don't think that's it.  I've seen the problem with 
Samba, NFS, and dd.

Thanks in advance.