Re: ad0: TIMEOUT - READ_DMA retrying (2 retries left) LBA=207594611

From: Scott Long <scottl_at_samsco.org> Date: Thu, 16 Sep 2004 14:17:56 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:12 UTC

Kevin Oberman wrote:
>>Date: Wed, 15 Sep 2004 15:05:34 -0600
>>From: Scott Long <scottl_at_samsco.org>
>>Sender: owner-freebsd-current_at_freebsd.org
>>
>>Søren Schmidt wrote:
>>
>>>Mike Jakubik wrote:
>>>
>>>
>>>>Søren Schmidt said:
>>>>
>>>>
>>>>
>>>>>You are having massive ICRC problems which are different and most likely
>>>>>due to bad cables/connectors or cables that are turned around (blue
>>>>>connector at controller, black/grey at devices), or it can be a
>>>>>weak/overloaded PSU.
>>>>>
>>>>
>>>>This is a different error message from what everyone else, including 
>>>>me is
>>>>reporting. What about the errors we are getting?
>>>
>>>
>>>I have no idea, I can't reproduce the problem at all. However I suspect 
>>>somthing else is blocking interrupt delivery but its just a hunch...
>>>
>>>-Søren
>>>
>>
>>I'm finding it hard to imagine a scenario where a timeout could fire but 
>>not a hardware interrupt.  Nothing usually shares the interrupt vectors
>>with ATA, so it's pretty unlikely that the ata ithread is being blocked
>>by anything but itself.
> 
> 
> This sounds reasonable, but I can make the problem start/stop by
> starting/stopping the network card. No problems in single-user. Then I
> 'ifconfig xl0 192.116.1.1' and immediately start getting the errors. I
> also get watchdog timeouts on xl0. 'ifconfig xl0 down' stops the errors.
> xl0 is on IRQ10, ata1 is on IRQ15. I have a K6 processor in an ASUS P5A
> with neither SMP or APIC. (I am running ACPI, not that there is much to
> it on this system.)
> 
> While I don't entirely discount the possibility that this is in ata, it
> seems odd that I get no errors even doing a buildworld as long as the
> network is off. 
> 
> This started pretty recently, but changes have been made in the period
> of suspicion to the scheduler, ACPI, and ata, so it's still fuzzy. My
> system gets the errors consistently enough that I will try to narrow
> down what patch caused the problem. (Wish it was a bit faster to build
> kernels, though!) I have a feeling in the pit of my stomach that it's
> going to show up at with a scheduler patch MT5, but I hope I'm wrong! I
> think I'd prefer an ATA problem to a scheduler issue. (Of course, Søren
> probably has a differing opinion on this.)

ATA commands are either completed in the bio_taskqueue or in a normal
taskqueue.  The bio_taskqueue runs in the g_up kthread while normal
taskqueues run in an swi kthread that multiplexes all of the registered
tasks.  Network drivers that are registered with IFF_NEEDSGIANT use a
taskqueue to help decouple the locking, and it could be that they are
stalling other tasks from running.  This doesn't seem to be the case
with xl(4).  However, the normal path for completing commands is with
the bio_taskqueue which should have no interaction at all with the
network side.  So either something else in the network stack is using a
taskqueue and using it pretty inefficiently, or preemption is general is
causing g_up to not run as often as it should.

I think that the untimeout of each command should be done in the
interupt handler and not in the taskqueue/bio_taskqueue.  Tasks don't
get lost out of either and will eventually run (or you'll have much
bigger problems if they don't) no matter what, so it's misleading to
say that a command timed out when really the hardware responded but the
system didn't get to the taskqueue fast enough.  In general I don't like
taskqueues anyways because they are non-deterministic; they really are
not good for anything that is time-critical.  Once we move to a scheme
were each device instance has its own ithread (i.e. no more sharing),
there won't be a need for taskqueues except to handle unusual/expensive
and non-time-critical events.

Scott