Re: ad WRITE_DMA timing out frequently

From: Reid Linnemann <lreid_at_cs.okstate.edu> Date: Fri, 18 Feb 2005 13:09:52 -0600 (CST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:28 UTC

On 2/18/2005, "Paul Mather" <paul_at_gromit.dlib.vt.edu> wrote:

>On Fri, 18 Feb 2005 09:03:35 -0600 (CST), "Reid Linnemann"
><lreid_at_cs.okstate.edu> wrote:
>
>> I've recently brought a machine up from 5.3-STABLE to 6-CURRENT. It
>> usually just sits in the corner and runs services, but lately I've
>> come
>> home form work or woken up to find that it is completely unresponsive,
>> and I have to hard reset the machine. It happens at least once a day,
>> and it's becoming more and more frequent. When I look at the console,
>> I
>> always have the same 4 messages before the failure:
>>
>> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
>> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
>>  kernel: ad0: FAILURE - WRITE_DMA timed out
>> kernel: g_vfs_done():ad0s1d[WRITE(offset=52772864, length=16384)]error
>> = 5
>>
>> It seems to me that a sector on the disk might be dead in the ad0s1d
>> slice (/var), but I want to be certain before I take further steps
>> that
>> the behavior I'm experiencing is positively unrelated to the migration
>> to 6-CURRENT.
>>
>> I started poking around /var to see if anything was amiss, and I found
>> that mail messages are being stacked up in /var/spool/clientmqueue,
>> even
>> though nothing should be using the msp queue (I've redirected periodic
>> outputs to logfiles).  In the last daily run mailed to root in
>> January,
>> I found records in the submit queue that looked like this:
>>
>> j0EDINHh049826     2489 Fri Jan 14 07:18 MAILER-DAEMON
>>                  (Deferred: Permission denied)
>>
>> There were nearly 500 of them.
>>
>> Even after redirecting periodic output to logs and clearing out the
>> client mail queue, this continues to happen, and I have a hunch that
>> it
>> may be related to the WRITE_DMA timeouts, as it's the only weird
>> behavior I can see on /var. If anyone can help me shed some light on
>> this, I'd appreciate it. I've had 2 IDE drives die in this machine
>> already, I'm going to be severely depressed if I've killed a third.
>
>The "TIMEOUT - WRITE_DMA" issue has been a recurring problem for me
>since somewhere in the 5.2.1--5.3 release range.  (It's been so long now
>that I don't remember whether it first started plaguing me in 5.2.1 or
>5.3.  I do know for definite I never got this problem in 5.1 and it only
>crept in during an "upgrade.")
>
>Like you, this has been happening more frequently with 6-CURRENT for me.
>As in your case, I come to find the machine completely unresponsive
>(though still pingable) and I have to hard reset the machine.  I'm
>finding this is now happening roughly every other day on average for the
>past week since my last system rebuild (FreeBSD 6.0-CURRENT #0: Fri Feb
>11 09:03:49 EST 2005).
>
>In my case, I'm using geom_mirror to mirror two drives.  The "TIMEOUT -
>WRITE_DMA" involves the geom_mirror metadata sector on one of the two
>drives, but not always the same one (sometimes it is ad0, sometimes
>ad2).  The net result is to cause the drive in question to be removed
>from the mirror.  Disappointingly, rather than carry on in degraded
>fashion, lately the system seems eventually to seize up as you describe.
>It doesn't seem to seize up immediately, because I notice an entry
>in /var/log/messages after the error but before the required hard reset
>reboot:
>
>Feb 18 05:24:38 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679
>Feb 18 05:24:43 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679
>Feb 18 05:24:43 zappa kernel: ad2: FAILURE - WRITE_DMA timed out
>Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Cannot update metadata on disk ad2 (error=5).
>Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Device raid1: provider ad2 disconnected.
>Feb 18 09:46:35 zappa named[349]: zone ./IN: Transfer started.
>Feb 18 09:46:35 zappa named[349]: transfer of './IN' from 128.9.0.107#53: connected using 192.168.1.25#64153
>Feb 18 09:46:37 zappa named[349]: zone ./IN: transferred serial 2005021800
>Feb 18 09:46:37 zappa named[349]: transfer of './IN' from 128.9.0.107#53: end of transfer
>[[forced reboot]]
>Feb 18 11:48:46 zappa syslogd: kernel boot file is /boot/kernel/kernel
>Feb 18 11:48:46 zappa kernel: Copyright (c) 1992-2005 The FreeBSD Project.
>Feb 18 11:48:46 zappa kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
>Feb 18 11:48:46 zappa kernel: The Regents of the University of California. All rights reserved.
>Feb 18 11:48:46 zappa kernel: FreeBSD 6.0-CURRENT #0: Fri Feb 11 09:03:49 EST 2005
>
>
>I get this problem on 6-CURRENT and also RELENG_5.  The RELENG_5 system
>has a geom_vinum mirrored setup, and when the "TIMEOUT - WRITE_DMA"
>occurs I lose the associated drive and plexes from the configuration.
>The problem does not happen as often on the RELENG_5 system, as it does
>on HEAD, at least nowadays it doesn't.
>
>I run smartctl on the systems, and none of the drives report any errors,
>and the "WORST" values recorded are nowhere near close to their
>respective failure thresholds.
>
>In my case, I have one area of commonality.  Between the three different
>systems on which I've experienced this problem, all use the Intel PIIX4
>ATA controller and the same IBM-DJNA-352500/J51OA30K hard drives.  So,
>I'm wondering if there is something about this particular combination
>that gives rise to this annoying problem?
>
>I do use the same IBM-DJNA-352500/J51OA30K hard drives in another system
>and have never experienced this (or any other) problem.  However, it is
>running 4.11-STABLE and has a "VIA 82C686 ATA66 controller", so it's
>impossible to tell if it's 4.11-STABLE or the VIA ATA controller
>contributing to the stability in that case.
>
>I don't think I have a hardware problem.  The same setup ran fine under
>earlier 5.x releases.  But, somewhere, this issue crept in (I remember
>threads on freebsd-current about it), and recently it appears to be
>getting worse (at least for me).  Also, unfortunately for me,
>geom_mirror used to roll with the punches when I lost a drive through
>this "TIMEOUT - WRITE_DMA" problem, but now it doesn't. :-(
>
>Cheers,
>
>Paul.
>--
>e-mail: paul_at_gromit.dlib.vt.edu
>
>"Without music to decorate it, time is just a bunch of boring production
> deadlines or dates by which bills must be paid."
>        --- Frank Vincent Zappa

The disk I am using is an IBM as well:

smartctl -a output yields this info on the device:

Device Model:     IBM-DPTA-372050
Serial Number:    JMYJM131600
Firmware Version: P76OA30A

My ATAPI controller is a VIA 82C686A as well.  I have been running
FreeBSD 4.3 up to 6-CURRENT with this controller without issue until now
too. So I think we can assume that the problem was introduced in 5 and
carried on through 6. I think I recall bumping into this with a Western
Digital 10 gig disk a while back on 5.3-STABLE. I was under school
pressure then and just dropped the drive out completely when I started
getting a hung system and ad0 messages. I'll plug it back in as a slave
this weekend and run smartctl on it.