Uneven load on drives in ZFS RAIDZ1

From: Stefan Esser <se_at_freebsd.org> Date: Mon, 19 Dec 2011 15:22:06 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC

Hi ZFS users,

for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
a longer log of 10 second averages logged with gstat:

dT: 10.001s  w: 10.000s  filter: ^a?da?.$
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0    130    106   4134    4.5     23   1033    5.2   48.8| ada0
    0    131    111   3784    4.2     19   1007    4.0   47.6| ada1
    0     90     66   2219    4.5     24   1031    5.1   31.7| ada2
    1     81     58   2007    4.6     22   1023    2.3   28.1| ada3

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1    132    104   4036    4.2     27   1129    5.3   45.2| ada0
    0    129    103   3679    4.5     26   1115    6.8   47.6| ada1
    1     91     61   2133    4.6     30   1129    1.9   29.6| ada2
    0     81     56   1985    4.8     24   1102    6.0   29.4| ada3

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1    148    108   4084    5.3     39   2511    7.2   55.5| ada0
    1    141    104   3693    5.1     36   2505   10.4   54.4| ada1
    1    102     62   2112    5.6     39   2508    5.5   35.4| ada2
    0     99     60   2064    6.0     39   2483    3.7   36.1| ada3

This goes on for minutes, without a change of roles (I had assumed that
other 10 minute samples might show relatively higher load on another
subset of the drives, but it's always the first two, which receive some
50% more read requests than the other two.

The test consisted of minidlna rebuilding its content database for a
media collection held on that pool. The unbalanced distribution of
requests does not depend on the particular application and the
distribution of requests does not change when the drives with highest
load approach 100% busy.

This is a -CURRENT built from yesterdays sources, but the problem exists
for quite some time (and should definitely be reproducible on -STABLE, too).

The pool consists of a 4 drive raidz1 on an ICH10 (H67) without cache or
log devices and without much ZFS tuning (only max. ARC size, should not
at all be relevant in this context):

zpool status -v
  pool: raid1
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        raid1       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0
            ada2p2  ONLINE       0     0     0
            ada3p2  ONLINE       0     0     0

errors: No known data errors

Cached configuration:
        version: 28
        name: 'raid1'
        state: 0
        txg: 153899
        pool_guid: 10507751750437208608
        hostid: 3558706393
        hostname: 'se.local'
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 10507751750437208608
            children[0]:
                type: 'raidz'
                id: 0
                guid: 7821125965293497372
                nparity: 1
                metaslab_array: 30
                metaslab_shift: 36
                ashift: 12
                asize: 7301425528832
                is_log: 0
                create_txg: 4
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 7487684108701568404
                    path: '/dev/ada0p2'
                    phys_path: '/dev/ada0p2'
                    whole_disk: 1
                    create_txg: 4
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 12000329414109214882
                    path: '/dev/ada1p2'
                    phys_path: '/dev/ada1p2'
                    whole_disk: 1
                    create_txg: 4
                children[2]:
                    type: 'disk'
                    id: 2
                    guid: 2926246868795008014
                    path: '/dev/ada2p2'
                    phys_path: '/dev/ada2p2'
                    whole_disk: 1
                    create_txg: 4
                children[3]:
                    type: 'disk'
                    id: 3
                    guid: 5226543136138409733
                    path: '/dev/ada3p2'
                    phys_path: '/dev/ada3p2'
                    whole_disk: 1
                    create_txg: 4

I'd be interested to know, whether this behavior can be reproduced on
other systems with raidz1 pools consisting of 4 or more drives. All it
takes is generating some disk load and running the command:

	gstat -I 10000000 -f '^a?da?.$'

to obtain 10 second averages.

I have not even tried to look at the scheduling of requests in ZFS, but
I'm surprised to see higher than average load on just 2 of the 4 drives,
since RAID parity should be evenly spread over all drives and for each
file system block a different subset of 3 out of 4 drives should be able
to deliver the data without need to reconstruct it from parity (that
would lead to an even distribution of load).

I've got two theories what might cause the obtained behavior:

1) There is some meta data that is only kept on the first two drives.
Data is evenly spread, but meta data accesses lead to additional reads.

2) The read requests are distributed in such a way, that 1/3 goes to
ada0, another 1/3 to ada1, while the remaining 1/3 is evenly distributed
to ada2 and ada3.

So: Can anybody reproduce this distribution requests?

Any idea, why this is happening and whether something should be changed
in ZFS to better distribute the load (leading to higher file system
performance)?

Best regards, STefan