Re: Uneven load on drives in ZFS RAIDZ1

From: Stefan Esser <se_at_freebsd.org> Date: Tue, 20 Dec 2011 12:45:48 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC

Am 19.12.2011 22:53, schrieb Dan Nelson:
> In the last episode (Dec 19), Stefan Esser said:
>> pool        alloc   free   read  write   read  write
>> ----------  -----  -----  -----  -----  -----  -----
>> raid1       4.41T  2.21T    139     72  12.3M   818K
>>   raidz1    4.41T  2.21T    139     72  12.3M   818K
>>     ada0p2      -      -    114     17  4.24M   332K
>>     ada1p2      -      -    106     15  3.82M   305K
>>     ada2p2      -      -     65     20  2.09M   337K
>>     ada3p2      -      -     58     18  2.18M   329K
>>
>> The same difference of read operations per second as shown by gstat ...
> 
> I was under the impression that the parity blocks were scattered evenly
> across all disks, but from reading vdev_raidz.c, it looks like that isn't
> always the case.  See the comment at the bottom of the
> vdev_raidz_map_alloc() function; it looks like it will toggle parity between
> the first two disks in a stripe every 1MB.  It's not necessarily the first

Thanks, this is very interesting information, indeed. I observed the
problem when minidlna rebuild its index database, which scans all media
files, many of them GBytes in length and sequentially written. This is a
typical scenario that should trigger the code you point at.

The comment explains that an attempt has been made to spread the (read)
load more evenly, if large files are sequentially written:

 * If all data stored spans all columns, there's a danger that parity
 * will always be on the same device and, since parity isn't read
 * during normal operation, that that device's I/O bandwidth won't be
 * used effectively. We therefore switch the parity every 1MB.

But they later found, that they failed to implement a good solution:

 * ... at least that was, ostensibly, the theory. As a practical
 * matter unless we juggle the parity between all devices evenly, we
 * won't see any benefit. Further, occasional writes that aren't a
 * multiple of the LCM of the number of children and the minimum
 * stripe width are sufficient to avoid pessimal behavior.

But I do not understand the reasoning behind:

 * Unfortunately, this decision created an implicit on-disk format
 * requirement that we need to support for all eternity, but only
 * for single-parity RAID-Z.

I see how the devidx and offset are swapped between col[0] and col[1],
and it appears that this swapping is not explicitly reflected in the
meta data. But there is no reason, that the algorithm could not be
modified to cover all drives, if some flag is set (which effectively
would lead to a 2nd generation raidz1 with incompatible block layout).

Anyway, I do not think that the current behavior is so bad, that it
needs immediate fixing.

> two disks assigned to the zvol, since stripes don't have to span all disks
> as long as there's one parity block (a small sync write may just hit two
> disks, essentially being written mirrored).  The imbalance is only visible
> if you're writing full-width stripes in sequence, so if you write a 1TB file
> in one long stream, chances are that that file's parity blocks will be
> concentrated on just two disks, so those two disks will get less I/O on
> later reads.  I don't know why the code toggles parity between just the
> first two columns; rotating it between all columns would give you an even
> balance.

Yes, but as the comment indicates, this would require introduction of a
different raidz1 (a higher ZFS revision or a flag could trigger that).

> Is it always the last two disks that have less load, or does it slowly
> rotate to different disks depending on the data that you are reading?  An
> interesting test would be to idle the system, run a "tar cvf /dev/null
> /raidz1" in one window, and watch iostat output on another window.  If the
> load moves from disk to disk as tar reads different files, then my parity
> guess is probably right.  If ada0 and ada1 are always busier, than you can
> ignore me :)

Yes, you are perfectly right! I tested the tar on a spool directory
holding DVB-C recordings (typical files length 2GB to 8GB). The

dT: 10.001s  w: 10.000s  filter: ^a?da?.$
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0    935    921  40216    0.4     13    139    0.5   32.8| ada0
    0    927    913  36530    0.3     13    108    1.5   31.8| ada1
    0    474    460  20110    0.7     14    141    0.9   32.4| ada2
    0    474    461  20102    0.7     13    141    0.7   31.6| ada3

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0   1046   1041  45503    0.3      5     35    0.9   31.5| ada0
    0   1039   1035  41353    0.3      4     23    0.4   31.6| ada1
    0    531    526  22827    0.6      5     38    0.4   33.4| ada2
    1    523    518  22772    0.6      5     38    0.6   30.8| ada3

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0    384    377  16414    0.8      7     46    3.3   30.2| ada0
    0    380    373  15857    0.8      6     42    0.4   30.5| ada1
    0    553    547  23937    0.5      6     47    1.7   28.0| ada2
    1    551    545  22004    0.6      6     38    0.7   32.2| ada3

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0    667    656  28633    0.4     11    123    0.6   29.6| ada0
    1    660    650  26010    0.5     10    109    0.6   33.4| ada1
    0    338    327  14328    0.8     11    126    0.9   25.7| ada2
    0    339    328  14303    1.0     11    120    1.0   32.7| ada3

$ iostat -d -n4 3
            ada0             ada1             ada2             ada3
  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s
  44.0 860 36.94   40.0 860 33.60   44.0 429 18.44   44.0 431 18.50
  43.9 814 34.86   39.9 813 31.67   43.8 408 17.45   43.7 408 17.44
  43.4 900 38.10   39.4 899 34.64   42.5 463 19.18   42.7 459 19.14
  44.0 904 38.86   40.0 904 35.33   44.0 453 19.44   44.0 452 19.42
! 43.1 571 24.01   41.5 571 23.17   43.4 799 33.85   40.0 801 31.27
! 44.0 461 19.79   44.0 460 19.74   44.0 920 39.52   40.0 920 35.93
! 43.9 435 18.65   43.9 435 18.68   44.0 868 37.29   40.0 868 33.91
! 42.8 390 16.29   42.8 390 16.28   43.4 765 32.42   39.4 767 29.48
! 44.0 331 14.22   44.0 329 14.12   44.0 659 28.32   40.0 659 25.75
! 41.8 332 13.55   42.1 326 13.38   42.9 640 26.84   39.0 640 24.38
  44.0 452 19.40   42.2 451 18.58   44.0 597 25.66   40.7 595 23.65
= 42.3 589 24.33   39.8 585 22.75   42.1 562 23.14   39.7 561 21.77
= 43.0 569 23.93   40.8 570 22.72   43.0 641 26.95   40.1 642 25.14
  44.0 709 30.48   40.9 710 28.41   44.0 607 26.10   41.8 606 24.73
  44.0 785 33.73   40.6 784 31.07   44.0 567 24.36   42.4 568 23.50
  44.0 899 38.62   40.0 899 35.11   44.0 449 19.30   44.0 450 19.32
  44.0 881 37.87   40.0 881 34.43   44.0 441 18.94   44.0 441 18.93
  43.4 841 35.61   39.4 841 32.37   42.7 428 17.87   42.7 428 17.84

Hmmm, looking back through hundreds of lines of iostat output I see that
ada0 and ada1 see similar request rates, as do ada2 and ada3.
But I know that I also observed other combinations on earlier tests
(with different data?).

> Since it looks like the algorithm ends up creating two half-cold parity
> disks instead of one cold disk, I bet a 3-disk RAIDZ would exhibit even
> worse balancing, and a 5-disk set would be more even.

Yes, this sounds very reasonable. Some iostat results were posted for
a 6 disk raidz1, but they were for writes, not reads. I've kept the
3*1TB drives that formed the pool before I replaced them by 4*2TB.
I can create a 3 drive raidz1 on them and perform some tests ...

BTW: Read throughput in the tar test was far lower than I had expected.
The CPU load was 3% user and some 0,2 system time (on an i2600K) and the
effective transfer speed of the RAID was only some 115MB/s.
The pool has 1/3 empty space and the test files were written in one go
and should have been layed out in an optimal way.

A dd of a large file (~10GB) gives similar results, independently of the
block size (128k vs. 1m).

Transfer sizes were only 43KB on average, which matches MAXPHYS=128KB
distributed over 3 drives (plus parity in case of writes). This
indicates, that in order to be able to read MAXPHYS bytes from each
drive, the original request size should have covered 3*MAXPHYS.

But the small transfer length does not seem to be the cause of the low
transfer rate:

# dd if=/dev/ada2p2 of=/dev/null bs=10k count=10000
10000+0 records in
10000+0 records out
102400000 bytes transferred in 0.853374 secs (119994281 bytes/sec)

# dd if=/dev/ada1p2 of=/dev/null bs=2k count=50000
50000+0 records in
50000+0 records out
102400000 bytes transferred in 2.668089 secs (38379531 bytes/sec)

Even a block size of 2KB will result in 35-40MB/s read throughput ...

Any idea, why the read performance is so much lower than possible given
the hardware?

Regards, STefan