ad WRITE_DMA timing out frequently

From: Reid Linnemann <lreid_at_cs.okstate.edu> Date: Fri, 18 Feb 2005 09:03:35 -0600 (CST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:28 UTC

I've recently brought a machine up from 5.3-STABLE to 6-CURRENT. It
usually just sits in the corner and runs services, but lately I've come
home form work or woken up to find that it is completely unresponsive,
and I have to hard reset the machine. It happens at least once a day,
and it's becoming more and more frequent. When I look at the console, I
always have the same 4 messages before the failure:

ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
 kernel: ad0: FAILURE - WRITE_DMA timed out
kernel: g_vfs_done():ad0s1d[WRITE(offset=52772864, length=16384)]error = 5

It seems to me that a sector on the disk might be dead in the ad0s1d
slice (/var), but I want to be certain before I take further steps that
the behavior I'm experiencing is positively unrelated to the migration
to 6-CURRENT.

I started poking around /var to see if anything was amiss, and I found
that mail messages are being stacked up in /var/spool/clientmqueue, even
though nothing should be using the msp queue (I've redirected periodic
outputs to logfiles).  In the last daily run mailed to root in January,
I found records in the submit queue that looked like this:

j0EDINHh049826     2489 Fri Jan 14 07:18 MAILER-DAEMON
                 (Deferred: Permission denied)

There were nearly 500 of them.

Even after redirecting periodic output to logs and clearing out the
client mail queue, this continues to happen, and I have a hunch that it
may be related to the WRITE_DMA timeouts, as it's the only weird
behavior I can see on /var. If anyone can help me shed some light on
this, I'd appreciate it. I've had 2 IDE drives die in this machine
already, I'm going to be severely depressed if I've killed a third.

-Reid