On 18/12/2009, at 9:39 PM, Phil Murray wrote: > > > On 18/12/2009, at 9:15 PM, "Alexander Zagrebin" <alexz_at_visp.ru> wrote: > >> Big thanks for your reply! >> >>>> I use onboard ICH7 SATA controller with two disks attached: >>>> >>>> atapci1:<Intel ICH7 SATA300 controller> port >>>> >>> 0x30c8-0x30cf,0x30ec-0x30ef,0x30c0-0x30c7,0x30e8-0x30eb,0x30a0 >>> -0x30af irq 19 >>>> at device 31.2 on pci0 >>>> atapci1: [ITHREAD] >>>> ata2:<ATA channel 0> on atapci1 >>>> ata2: [ITHREAD] >>>> ata3:<ATA channel 1> on atapci1 >>>> ata3: [ITHREAD] >>>> ad4: 1430799MB<Seagate ST31500541AS CC34> at ata2-master SATA150 >>>> ad6: 1430799MB<WDC WD15EADS-00P8B0 01.00A01> at ata3-master SATA150 >>>> >>>> The disks are used for mirrored ZFS pool. >>>> I have noticed that the system periodically locks up on >>> disk operations. >>>> After approx. 10 min of very slow disk i/o (several KB/s) >>> the speed of disk >>>> operations restores to normal. >>>> gstat has shown that the problem is in ad6. >>>> For example, there is a filtered output of iostat -x 1: >>>> >>>> extended device statistics >>>> device r/s w/s kr/s kw/s wait svc_t %b >>>> ad6 985.1 0.0 5093.9 0.0 0 0.2 23 >>>> ad6 761.8 0.0 9801.3 0.0 1 0.4 31 >>>> ad6 698.7 0.0 9215.1 0.0 0 0.4 30 >>>> ad6 434.2 513.9 5903.1 13658.3 48 10.2 55 >>>> ad6 3.0 762.8 191.2 28732.3 0 57.6 99 >>>> ad6 10.0 4.0 163.9 4.0 1 1.6 2 >>>> >>>> Before this line we have a normal operations. >>>> Then the behaviour of ad6 changes (pay attention to high >>> average access time >>>> and percent of "busy" significantly greater than 100): >>>> >>>> ad6 0.0 0.0 0.0 0.0 1 0.0 0 >>>> ad6 1.0 0.0 0.5 0.0 1 1798.3 179 >>>> ad6 1.0 0.0 1.5 0.0 1 1775.4 177 >>>> ad6 0.0 0.0 0.0 0.0 1 0.0 0 >>>> ad6 10.0 0.0 75.2 0.0 1 180.3 180 >>>> ad6 0.0 0.0 0.0 0.0 1 0.0 0 >>>> ad6 1.0 0.0 2.0 0.0 1 1786.7 178 >>>> ad6 0.0 0.0 0.0 0.0 1 0.0 0 >>>> >>>> And so on for about 10 minutes. >>>> Then the disk i/o is reverted to normal: >>>> >>>> ad6 139.4 0.0 8860.5 0.0 1 4.4 61 >>>> ad6 167.3 0.0 10528.7 0.0 1 3.3 55 >>>> ad6 60.8 411.5 3707.6 8574.8 1 19.6 87 >>>> ad6 163.4 0.0 10334.9 0.0 1 4.4 72 >>>> ad6 157.4 0.0 9770.7 0.0 1 5.0 78 >>>> ad6 108.5 0.0 6886.8 0.0 0 3.9 43 >>>> >>>> There are no ata error messages neither in the system log, >>> nor on the >>>> console. >>>> The manufacture's diagnostic test is passed on ad6 without >>> any errors. >>>> The ad6 also contains swap partition. >>>> I have tried to run several (10..20) instances of dd, which >>> read and write >>>> data >>>> from and to the swap partition simultaneously, but it has >>> not called the >>>> lockup. >>>> So there is a probability that this problem is ZFS related. >>>> >>>> I have been forced to switch ad6 to the offline state... :( >>>> >>>> Any suggestions on this problem? >>>> >>> I also have been experiencing the same problem with a different >>> disk/controller (via mpt on a vmware machine). During the >>> same period I >>> notice that system cpu usage hits 80+% and top shows the >>> zfskern process >>> being the main culprit. At the same time I've discovered the >>> kstat.zfs.misc.arcstats.memory_throttle_count sysctl rising. >>> Arc is also >>> normally close to the arc_max limit. >> >> My case has differences. >> 1. CPU usage is near 0% >> 2. zfs's sysctls doesn't change significantly during >> "normal operation" -> "lockup" -> "normal" transition >> 3. ARC size is far from its limits, >> kstat.zfs.misc.arcstats.memory_throttle_count: 0 >> >> Here my actions, observations and conclusions: >> 1. I have tried to change placements of disks on sata channels. >> Nothing has changed - the problems still on WD15EADS, although it became >> ad4. >> So issue isn't in south bridge, sata cables and so on. >> 2. I have tried to detach ad6 from the pool, to zero system area, and to >> reattach it again. >> Of course, resilvering was started. During resilvering 250 GB was copied >> without lockups >> and delays. While resilvering, I have tried periodically to load drive >> with a read >> operations (dd if=/dev/ad6 of=/dev/null ...). >> But after resilvering and several minutes of normal mirror operation, >> lockups appeared again. >> So drive is seems to be ok and we have a software problem? >> 3. I have noticed that lockups often happens during postgresql activity. >> postgresql often uses sync. So I have tried to disable ZIL. >> No success. >> 4. "IDE LED" is constantly on during lockups. >> So it is really read/write delays. >> 5. I see two variants of zfskern's state: >> a) it is constantly in the vgeom:io >> b) it is in either zio->io_ state (when active), or in tx->tx_s (when >> idle). >> During lockups it is mostly in zio->io_. >> What the difference with vgeom:io and zio->io_/tx->tx_s? >> >> May be a problem is in ata? WD15EADS is a "green" series of drives. > > The WD green drives have a feature called Time Limited Error Recovery where the disk can spend several minutes trying to read a bad block etc. > > It plays havoc with RAID arrays which is why WD recommend you don't use the green drives in arrays. They have more info about the "feature" in the WD FAQ/knowledgebase > Sorry, TLER is the feature that 'fixes' the problem, see: http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=1397&p_created=1131638613&p_sid=vfyE1KPj&p_accessibility=0&p_redirect=&p_srch=1&p_lva=&p_sp=cF9zcmNoPTEmcF9zb3J0X2J5PSZwX2dyaWRzb3J0PSZwX3Jvd19jbnQ9MTcsMTcmcF9wcm9kcz0yMjcsMjk0JnBfY2F0cz0mcF9wdj0yLjI5NCZwX2N2PSZwX3BhZ2U9MSZwX3NlYXJjaF90ZXh0PXJhaWQ!&p_li=&p_topview=1 Sounds like your drive is going into the recovery procedure... > >> May be i have a problem with its power management? >> Is there a method to completely reset sata channel and drive? >> atacontrol reinit will do it? > > > >> >> Any help is welcomed. >> >> -- >> Alexander Zagrebin >> >> _______________________________________________ >> freebsd-current_at_freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-current >> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org" > _______________________________________________ > freebsd-current_at_freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"Received on Fri Dec 18 2009 - 08:11:55 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:59 UTC