ATA DMA dump failures: take 2

From: Dmitry Pryanishnikov <dmitry_at_atlantis.dp.ua> Date: Sat, 25 Feb 2006 16:44:11 +0200 (EET) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:52 UTC

Hello!

   I'm trying to find why the new ATA DMA dump code in CURRENT fails under
some conditions. My conditions IMHO are very common: I issue

cd /usr/ports/editors/openoffice.org-2.0
NOCLEANDEPENDS=yes make extract clean

(just to create and then delete a LOT of files) on ASUS M5A notebook
with "only" 256Mb of RAM. This reliably panics my system during the clean
pass, when softupdates code runs into the shortage of kmem_map:

panic: kmem_malloc(4096): kmem_map too small: 82014208 total allocated

Before trying to understand how to tune my system better (alas tuning(7)
doesn't mention kmem_map at all) I'm trying to obtain crash dump, but
I'm just getting infamous "FAILURE - out of memory in start" error in
ad_strategy. OK, it's very unwise to rely on availability of kernel
memory in situations like mine. But we can easily guard against it by 
preallocating a spare "struct ata_request". I've created a simple patch:

ftp://external.atlantis.dp.ua/FreeBSD/CURRENT/nodump/ata-disk.c.patch

wich solves this allocation problem and instruments code in order to
understand code flow. Note that it's unclear to me _what_ guarantees
that ad_strategy() will always finish it's job, so I've added a check
for BIO_DONE. Actually once I've got this check failed, and my system
was just keeping print '.''s (request has never been finished).

   But the most serious problem is that in more than 90% of cases I don't even 
come to printf("}"); ! I'm just getting another "panic: double fault" instead.
Look at the pictures DSCN1971-4 in the same folder as patch. On the 1st 
picture you can see that panic happens during the execution of ad_strategy()
(there is a "{" w/o matching "}"). On the 2nd you see the start of 'bt'
output. I've no idea about trap 0x17 - is it stack overflow or something else?
On the 3rd you can see what that main part of the stack is filled with:
repetitive sequence of nested

ata_start()
ata_interrupt()
ata_finish()
ata_completed()

4th picture is the point where initial ad_dump() takes place. My theory is 
that ata driver tries to finish off all queued I/O requests and is running out 
of the stack. And the question here is whether driver should try to complete 
those previously queued requests at all: OS has just crashed, so data (and 
disk block numbers!) in those request can be invalid.

  My main question is whether dump speed increase worth the loss of dump
robustness? I think it's not. Alas, this new dump code has already been
commited to RELENG_6, so IMHO we should try to fix this issue before
ongoing 6.1-RELEASE. Impossibility to obtain a crash dump can make developer's 
life really difficult. IMHO we should try to make the new code robust
(so it won't fail in the case of OS resource shortages), but if we fail
the good old (slow but always working) dump code should be restored.

Sincerely, Dmitry
-- 
Atlantis ISP, System Administrator
e-mail:  dmitry_at_atlantis.dp.ua
nic-hdl: LYNX-RIPE