Re: memory corruption/panic solved ("FAILURE - ATAPI_IDENTIFY no interrupt")

From: Scott Long <scottl_at_samsco.org>
Date: Sun, 01 Aug 2004 22:04:05 -0600
Brian Fundakowski Feldman wrote:
> On Sun, Aug 01, 2004 at 05:05:19PM -0700, Nate Lawson wrote:
> 
>>Brian Fundakowski Feldman wrote:
>>
>>>On Fri, Jul 30, 2004 at 03:48:52PM -0700, Nate Lawson wrote:
>>>
>>>>I've tracked down the source of the memory corruption in -current that
>>>>results when booting with various CD and DVD drives (especially the ones
>>>>that come with Thinkpads including T23, R32, T41, etc.)  The panic is
>>>>obvious when running with INVARIANTS ("memory modified after free") but
>>>>not so obvious in other configurations.  For instance, without
>>>>INVARIANTS, part of the rt_info structure is corrupted on my wireless
>>>>card, resulting in a panic during ifconfig on boot.  This is likely the
>>>>source of other problems, including phk's ACPI panic (again, only
>>>>triggered when booting with the CD drive in the bay.)
>>>>
>>>>The root problem is that ata_timeout() fires and calls ata_pio_read()
>>>>which overwrites 512 bytes random memory.  There are actually two bugs 
>>>>here that overwrite memory.  The code path is as follows:
>>>
>>>Good job identifying it more exactly.  I decided it should just 
>>>fundamentally
>>>be using GEOM primitives everywhere to move the solutions to all these
>>>side cases into where they're already handled generically... still think
>>>that's probably the right solution, but I'm glad to see this specific
>>>problem fixed.
>>
>>I'm not sure if this is a troll or not but I'll answer it seriously.
>>GEOM and other upper layers are never the right place to handle error
>>recovery for transactions initiated at the lower layers (like this 
>>device scan).
>>
>>In every system I've seen, error recovery is the hardest part of storage
>>code to get right and is seldom well-tested.  It's a very difficult
>>problem that involves a lot of careful fault injection/testing.
>>Divergence in hardware fault handling behavior only complicates things.
> 
> 
> What would make it a troll?  If GEOM were used so that all transactions
> were centrallized, and there were one timeout mechanism used to run the
> request queues for ATA, it wouldn't be racing and crashing when a device
> reset occurs (and it would be a net reduction in code).
> 

Nate is absolutely correct.  Error recovery is something that belongs in
the device drivers, not in the common block layer.  Also, your proposal
is entirely ATA centric and ignores the myriad of other block drivers in
the tree, which again all have different recovery requirements.  GEOM
doesn't need to know the gory details of command timeouts, retries, etc.
That's why we have an ATA framework and a SCSI framework.  In fact if
you look at SCSI, CAM doesn't even track command timeouts because the
actual hardware driver still has better access to dealing with those
kinds of problems than the upper layer.

Error recovery belongs where it is.  If the ATA driver has bugs, then
the ATA driver needs to be dealt with.

Scott
Received on Mon Aug 02 2004 - 02:06:22 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:04 UTC