Re: 8.0-RELEASE: disk IO temporarily hangs up (ZFS or ATA relatedproblem)

From: Steven Hartland <killing_at_multiplay.co.uk>
Date: Thu, 17 Dec 2009 13:42:24 -0000
We see a very similar pattern on straight ufs disks on an areca controller. Machine
console becomes totally unresponsive 30 or so seconds at a time, so it maybe
something more intergral to device operation, buffercache or something?

The IO load on the machine is being generated by updating rrds files in mrtg
for a few hundred switch ports and a few thousand servers.


device     r/s   w/s    kr/s    kw/s wait svc_t  b  
da0        0.0   0.0     0.0     0.0 1584   0.0   0 
pass0      0.0   0.0     0.0     0.0    0   0.0   0 
pass1      0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  b  
da0        0.0 342.6     0.0  5577.1 1241 7325.5 241 
pass0      0.0   0.0     0.0     0.0    0   0.0   0 
pass1      0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  b  
da0        0.0   0.0     0.0     0.0 1241   0.0   0 
pass0      0.0   0.0     0.0     0.0    0   0.0   0 
pass1      0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  b  
da0        0.0   0.0     0.0     0.0 1241   0.0   0 
pass0      0.0   0.0     0.0     0.0    0   0.0   0 
pass1      0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  b  
da0        0.0 930.8     0.0 15165.2  309 10483.0 322 
pass0      0.0   0.0     0.0     0.0    0   0.0   0 
pass1      0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  b  
da0        0.0 267.7     0.0  4330.7   41 10952.8  41 
pass0      0.0   0.0     0.0     0.0    0   0.0   0 
pass1      0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  b  
da0        0.0 232.7     0.0  3739.4    0 2207.4 158 
pass0      0.0   0.0     0.0     0.0    0   0.0   0 
pass1      0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  b  
da0        0.0   0.0     0.0     0.0    0   0.0   0 
pass0      0.0   0.0     0.0     0.0    0   0.0   0 
pass1      0.0   0.0     0.0     0.0    0   0.0   0 
                        extended device statistics  
device     r/s   w/s    kr/s    kw/s wait svc_t  b  
da0        0.0   0.0     0.0     0.0    0   0.0   0 
pass0      0.0   0.0     0.0     0.0    0   0.0   0 
pass1      0.0   0.0     0.0     0.0    0   0.0   0 

----- Original Message ----- 
From: "Alexander Zagrebin" <alexz_at_visp.ru>
To: <freebsd-current_at_freebsd.org>
Sent: Thursday, December 17, 2009 9:00 AM
Subject: 8.0-RELEASE: disk IO temporarily hangs up (ZFS or ATA relatedproblem)


> Hi!
> 
> I use onboard ICH7 SATA controller with two disks attached:
> 
> atapci1: <Intel ICH7 SATA300 controller> port
> 0x30c8-0x30cf,0x30ec-0x30ef,0x30c0-0x30c7,0x30e8-0x30eb,0x30a0-0x30af irq 19
> at device 31.2 on pci0
> atapci1: [ITHREAD]
> ata2: <ATA channel 0> on atapci1
> ata2: [ITHREAD]
> ata3: <ATA channel 1> on atapci1
> ata3: [ITHREAD]
> ad4: 1430799MB <Seagate ST31500541AS CC34> at ata2-master SATA150
> ad6: 1430799MB <WDC WD15EADS-00P8B0 01.00A01> at ata3-master SATA150
> 
> The disks are used for mirrored ZFS pool.
> I have noticed that the system periodically locks up on disk operations.
> After approx. 10 min of very slow disk i/o (several KB/s) the speed of disk
> operations restores to normal.
> gstat has shown that the problem is in ad6.
> For example, there is a filtered output of iostat -x 1:
> 
>                        extended device statistics
> device     r/s   w/s    kr/s    kw/s wait svc_t  %b
> ad6      818.6   0.0 10840.2     0.0    0   0.4  34
> ad6      300.6 642.0  3518.5 24830.3   50  24.8  72
> ad6        1.0 639.3    63.7 17118.3    0  62.1  98
> ad6      404.5   4.0  6837.7     4.0    0   0.5  18
> ad6      504.5   0.0 13667.2     0.0    1   0.7  32
> ad6      633.3   0.0 13190.3     0.0    1   0.7  38
> ad6      416.3 384.5  8134.7 24606.2    0  16.3  57
> ad6      538.9  76.7  9772.8  2982.2   55   2.9  40
> ad6       31.9 929.5   801.0 37498.6    0  27.2  82
> ad6      635.5   0.0 13087.1     0.0    1   0.6  35
> ad6      579.6   0.0 16669.8     0.0    0   0.8  43
> ad6      603.6   0.0 11697.4     0.0    1   0.7  40
> ad6      538.0   0.0 10438.7     0.0    0   0.9  47
> ad6       30.9 898.4   868.6 40585.4    0  36.6  78
> ad6      653.3  86.6  8566.6   202.7    1   0.8  40
> ad6      737.1   0.0  6429.4     0.0    1   0.6  42
> ad6      717.1   0.0  3958.7     0.0    0   0.5  36
> ad6     1179.5   0.0  2058.9     0.0    0   0.1  15
> ad6     1191.2   0.0  1079.6     0.0    1   0.1  15
> ad6      985.1   0.0  5093.9     0.0    0   0.2  23
> ad6      761.8   0.0  9801.3     0.0    1   0.4  31
> ad6      698.7   0.0  9215.1     0.0    0   0.4  30
> ad6      434.2 513.9  5903.1 13658.3   48  10.2  55
> ad6        3.0 762.8   191.2 28732.3    0  57.6  99
> ad6       10.0   4.0   163.9     4.0    1   1.6   2
> 
> Before this line we have a normal operations.
> Then the behaviour of ad6 changes (pay attention to high average access time
> and percent of "busy" significantly greater than 100):
> 
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6        1.0   0.0     0.5     0.0    1 1798.3 179
> ad6        1.0   0.0     1.5     0.0    1 1775.4 177
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6       10.0   0.0    75.2     0.0    1 180.3 180
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6       83.7   0.0   862.9     0.0    1  21.4 179
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6        1.0   0.0    63.7     0.0    1 1707.4 170
> ad6        1.0   0.0     9.0     0.0    0 1791.0 178
> ad6       10.9   0.0   172.2     0.0    2   0.2   0
> ad6       24.9   0.0   553.7     0.0    1 143.3 179
> ad6        0.0   0.0     0.0     0.0    7   0.0   0
> ad6        2.0  23.9    32.4  1529.9    1 336.3 177
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6       68.7   0.0  1322.8     0.0    1  26.3 181
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6       27.9   0.0   193.7     0.0    1  61.6 172
> ad6        1.0   0.0     2.5     0.0    1 1777.4 177
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6        1.0   0.0     2.0     0.0    1 1786.9 178
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6        2.0   0.0     6.5     0.0    1 899.4 179
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6        1.0   0.0     2.0     0.0    1 1786.7 178
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> 
> And so on for about 10 minutes.
> Then the disk i/o is reverted to normal:
> 
> ad6      139.4   0.0  8860.5     0.0    1   4.4  61
> ad6      167.3   0.0 10528.7     0.0    1   3.3  55
> ad6       60.8 411.5  3707.6  8574.8    1  19.6  87
> ad6      163.4   0.0 10334.9     0.0    1   4.4  72
> ad6      157.4   0.0  9770.7     0.0    1   5.0  78
> ad6      108.5   0.0  6886.8     0.0    0   3.9  43
> ad6      101.6   0.0  6381.4     0.0    0   2.6  27
> ad6      109.6   0.0  7013.9     0.0    0   2.0  22
> ad6      121.4   0.0  7769.7     0.0    0   2.4  29
> ad6       92.5   0.0  5922.6     0.0    1   3.4  31
> ad6      122.4  19.9  7833.0  1273.7    0   3.9  54
> ad6       83.6   0.0  5349.5     0.0    0   3.9  33
> ad6        5.0   0.0   318.4     0.0    0   8.1   4
> 
> There are no ata error messages neither in the system log, nor on the
> console.
> The manufacture's diagnostic test is passed on ad6 without any errors.
> The ad6 also contains swap partition.
> I have tried to run several (10..20) instances of dd, which read and write
> data
> from and to the swap partition simultaneously, but it has not called the
> lockup.
> So there is a probability that this problem is ZFS related.
> 
> I have been forced to switch ad6 to the offline state... :(
> 
> Any suggestions on this problem?
> 
> -- 
> Alexander Zagrebin
> 
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster_at_multiplay.co.uk.
Received on Thu Dec 17 2009 - 12:53:07 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:59 UTC