Re: Uneven load on drives in ZFS RAIDZ1

From: Stefan Esser <se_at_freebsd.org>
Date: Mon, 19 Dec 2011 19:56:03 +0100
Am 19.12.2011 16:42, schrieb Peter Maloney:
> On 12/19/2011 03:22 PM, Stefan Esser wrote:
>> So: Can anybody reproduce this distribution requests?
> I don't have a raidz1 machine, and no time to make you a special raidz1
> pool out of spare disks, but on my raidz2 I can only ever see unevenness
> when a disk is bad, or between different vdevs. But you only have one vdev.

Thanks for replying.

In my previous raidz1 pool consisting of 3*1TB, one of the drives had to
be replaced because it showed lots of recoverable errors when I
initially created the pool. The effects where much more drastic than
what I see now: Given identical request rates, the failed drive was 100%
busy when the other drives had busy percentages in the one digit range.

But the observed differences seem to be caused by a different rate of
read requests issued towards the drives (the first two receive 30% of
the reads, each, while the last two receive 20% each). And this ratio
has been stable over months (I had already noticed this in summer, but
did not have time to start a thread at that time).


> Check is that your disks are identical (are they? we can only assume so
> since you didn't say so).

Yes, all 4 are identical.

> Show us output from:
> smartctl -i /dev/ada0

Model Family:     SAMSUNG SpinPoint F4 EG (AFT)
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7JD1B116957
LU WWN Device Id: 5 0024e9 0049bee63
Firmware Version: 1AQ10001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Mon Dec 19 19:23:36 2011 CET

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always
      -       0
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always
      -       0
  3 Spin_Up_Time            0x0023   067   067   025    Pre-fail  Always
      -       10127
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always
      -       254
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always
      -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always
      -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age
Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always
      -       2300
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always
      -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always
      -       1
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
      -       228
181 Program_Fail_Cnt_Total  0x0022   100   100   000    Old_age   Always
      -       621067
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always
      -       4
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always
      -       0
194 Temperature_Celsius     0x0002   064   055   000    Old_age   Always
      -       28 (Min/Max 15/48)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always
      -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always
      -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always
      -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always
      -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always
      -       2
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always
      -       1
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always
      -       264

> smartctl -i /dev/ada1

Model Family:     SAMSUNG SpinPoint F4 EG (AFT)
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7JD1B116947
LU WWN Device Id: 5 0024e9 0049bee49
Firmware Version: 1AQ10001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Mon Dec 19 19:23:22 2011 CET

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always
      -       0
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always
      -       0
  3 Spin_Up_Time            0x0023   067   067   025    Pre-fail  Always
      -       10096
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always
      -       255
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always
      -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always
      -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age
Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always
      -       2316
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always
      -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always
      -       1
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
      -       231
181 Program_Fail_Cnt_Total  0x0022   100   100   000    Old_age   Always
      -       2175909
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always
      -       1
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always
      -       0
194 Temperature_Celsius     0x0002   064   055   000    Old_age   Always
      -       26 (Min/Max 16/47)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always
      -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always
      -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always
      -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always
      -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always
      -       1
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always
      -       1
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always
      -       264

> smartctl -i /dev/ada2

Model Family:     SAMSUNG SpinPoint F4 EG (AFT)
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7JD1B116956
LU WWN Device Id: 5 0024e9 0049bee60
Firmware Version: 1AQ10001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Mon Dec 19 19:24:24 2011 CET

  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always
      -       0
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always
      -       0
  3 Spin_Up_Time            0x0023   067   066   025    Pre-fail  Always
      -       10254
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always
      -       246
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always
      -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always
      -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age
Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always
      -       2300
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always
      -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always
      -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
      -       227
181 Program_Fail_Cnt_Total  0x0022   100   100   000    Old_age   Always
      -       105259
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always
      -       1
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always
      -       0
194 Temperature_Celsius     0x0002   064   056   000    Old_age   Always
      -       28 (Min/Max 16/45)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always
      -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always
      -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always
      -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always
      -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always
      -       0
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always
      -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always
      -       256

> smartctl -i /dev/ada3

Model Family:     SAMSUNG SpinPoint F4 EG (AFT)
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7JD1B116946
LU WWN Device Id: 5 0024e9 0049bee47
Firmware Version: 1AQ10001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Mon Dec 19 19:24:55 2011 CET

 1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always
      -       0
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always
      -       0
  3 Spin_Up_Time            0x0023   066   066   025    Pre-fail  Always
      -       10472
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always
      -       250
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always
      -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always
      -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age
Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always
      -       2302
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always
      -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always
      -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
      -       227
181 Program_Fail_Cnt_Total  0x0022   100   100   000    Old_age   Always
      -       239254
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always
      -       1
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always
      -       0
194 Temperature_Celsius     0x0002   064   055   000    Old_age   Always
      -       27 (Min/Max 16/47)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always
      -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always
      -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always
      -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always
      -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always
      -       2
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always
      -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always
      -       259

> Since your tests show read ms/r to be pretty even, I guess your disks
> are not broken. But the ms/w is slightly different. So I think it seems
> that the first 2 disks are slower for writing (someone once said that


My interpretation is, that the first two have higher write latencies
since they receive more read requests.

> refurbished disks are like this, even if identical), or the hard disk
> controller ports they use are slower. For example, maybe your
> motherboard has 6 ports, and you plugged disks 1,2,3 into port 1,2,3 and
> disk 4 into port 5. Disk 3 and 4 would have their own channel, but disk
> 1 and 2 share one.

This is an ICH10 and the drives are connected to SATA II channels (the
SATA III channels are reserved for a planned SSD cache).

> So if the disks are identical, I would guess your hard disk controller
> is to blame. To test this, first back it up. Then *fix your setup by
> using labels*. ie. use gpt/somelabel0 or gptid/....... rather than
> ada0p2. Check "ls /dev/gpt*" output for options on what labels you have
> already. Then try swapping disks around to see if the load changes. Make
> sure to back up...

The drives are lalready abelled and I can easily modify the pool to
refer to GPT labels. But swapping drives should not cause any harm in
ZFS, whether labels are device names are used (the drives in the pool
are identified by their GUID).

> Swapping disks (or even removing one depending on controller, etc. when
> it fails) without labels can be bad.

Yes, I know (having seen my first Unix system more than 30 years ago).
I'll re-import the drives with "zpool import -d /dev/gpt ..." but need
to boot from an alternate boot device first.

> eg.
> You have ada1 ada2 ada3 ada4.
> Someone spills coffee on ada2; it fries and cannot be detected anymore,
> and you reboot.
> Now you have ada1 ada2 ada3.
> Then things are usually still fine (even though ada3 is now ada2 and
> ada4 is now ada3, because there is some zfs superblock stuff to keep
> track of things), but if you also had an ada5 that was not part of the
> pool, or was a spare or a log or something other than another disk in
> the same vdev as ada1, etc., bad things happen when it becomes ada4.
> Unfortunately, I don't know exactly what people do to cause the "bad
> things" that happen. When this happened to me, it just said my pool was
> faulted or degraded or something, and set a disk or two to UNAVAIL or
> FAULTED. I don't remember it automatically resilvering them, but when I
> read about these problems, I think it seems like some disks were
> resilvered afterwards.

The recovery from partial pool failures and the collection of drives to
form a pool has been modified several times in the last two years and
should be quite robust by now. One thing to look out for is to not copy
a pool to new disk drives (I used to have 3*1TB, copied to 4*2TB) and
later connect a drive from the original pool with its ZFS metadata
intact at the end of the drive (I had cleared the first 1MB, but not the
last 1MB). This causes confusion, if the name of the pool has not
changed. But other than that, I do not see much risk in ZFS pools built
from /dev nodes.

> And last thing I can think of is to make sure your partitions are
> aligned, and identical. Show us output from:
> gpart show

They have all been created by a script that takes the device node name
as parameter and thus are identical.

=>        34  3907029101  ada0  GPT  (1.8T)
          34          30        - free -  (15k)
          64         192     1  freebsd-boot  (96k)
         256  3565158400     2  freebsd-zfs  (1.7T)
  3565158656   341870479     3  freebsd  (163G)

=>        34  3907029101  ada1  GPT  (1.8T)
          34          30        - free -  (15k)
          64         192     1  freebsd-boot  (96k)
         256  3565158400     2  freebsd-zfs  (1.7T)
  3565158656   341870479     3  freebsd  (163G)

=>        34  3907029101  ada2  GPT  (1.8T)
          34          30        - free -  (15k)
          64         192     1  freebsd-boot  (96k)
         256  3565158400     2  freebsd-zfs  (1.7T)
  3565158656   341870479     3  freebsd  (163G)

=>        34  3907029101  ada3  GPT  (1.8T)
          34          30        - free -  (15k)
          64         192     1  freebsd-boot  (96k)
         256  3565158400     2  freebsd-zfs  (1.7T)
  3565158656        1792        - free -  (896k)
  3565160448   341868544     3  freebsd-swap  (163G)
  3907028992         143        - free -  (71k)


There is an unused 10% at the end of each device, and I have recently
made ada3p3 a swap device, just to be able to collect kernel dumps (no
swpa is actually used; this is an 8GB RAM machine with 6GB assigned to
ARC and mostly low load).

Best regards, STefan
Received on Mon Dec 19 2011 - 18:09:03 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC