Re: aac(4) handling of probe when no devices are there

From: Alexander Sack <pisymbol_at_gmail.com> Date: Wed, 16 Dec 2009 12:10:59 -0500 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:59 UTC

On Tue, Dec 15, 2009 at 4:54 AM, Scott Long <scottl_at_samsco.org> wrote:
> On Dec 14, 2009, at 2:47 PM, Alexander Sack wrote:
>>
>> Hello Again:
>>
>> I guess I have a technical question/concern that I was looking for
>> feedback.   During the probe sequence, aac(4) conditionally responds
>> to INQUIRY commands depending on target LUN:
>>
>> aac_cam.c/aac_cam_complete():
>> 532                         if (command == INQUIRY) {
>> 533                                 if (ccb->ccb_h.status == CAM_REQ_CMP)
>> {
>> 534                                 device = ccb->csio.data_ptr[0] & 0x1f;
>> 535                                 /*
>> 536                                  * We want DASD and PROC devices to
>> only be
>> 537                                  * visible through the pass device.
>> 538                                  */
>> 539                                 if ((device == T_DIRECT) ||
>> 540                                     (device == T_PROCESSOR) ||
>> 541                                     (sc->flags &
>> AAC_FLAGS_CAM_PASSONLY))
>> 542                                         ccb->csio.data_ptr[0] =
>> 543                                             ((device & 0xe0) |
>> T_NODEVICE);
>> 544                                 } else if (ccb->ccb_h.status ==
>> CAM_SEL_TIMEOUT &&
>> 545                                         ccb->ccb_h.target_lun != 0) {
>> 546                                         /* fix for INQUIRYs on Lun>0
>> */
>> 547                                         ccb->ccb_h.status =
>> CAM_DEV_NOT_THERE;
>> 548                                 }
>> 549                         }
>>
>> Why is CAM_DEV_NOT_THERE skipped on LUN 0?
>
> In the parallel scsi world, a selection timeout means that all LUNs within
> the entire target  do not (or no longer) exist.  So returning
> CAM_SEL_TIMEOUT for LUN 1 would tell CAM to invalidate LUN 0 as well.
>
> If you look higher up in this function, you'll see a note about the
> error/status codes from the AAC firmware coincidentally matching CAM's
> status codes.  My guess is that somewhere along the line, someone at Adaptec
> stopped reading the SCSI spec and starting returning CAM_SEL_TIMEOUT for
> LUNs greater than 0, which is why this work-around is now in the driver.

Interesting.  Learn something everyday.  I did not know that a
selection timeout on a non-zero LUN meant no other LUN was available.
As a colleague noted, "Has Adaptec ever read the SCSI spec?"  Just
kidding (somewhat)....

>>  This is true on my target
>> 6.1-amd64 machine as well as CURRENT.  The reason why I ask this is
>> because now that aac(4) is sequential scanned, there are a lot of cam
>> interrupts that come in on my 6.x machine where the threshold is only
>> 500 and I get the interrupt storm threshold warning for swi2 pretty
>> quickly:
>>
>> Interrupt storm detected on "swi2:"; throttling interrupt source
>>
>> Obviously its contingent on the number of adapters you have on your
>> system.  On CURRENT I didn't see this because the threshold is double
>> (I think its a 1000 by default).
>>
>> The issue is the number of xpt_async(AC_LOST_DEVICE, ..) calls during
>> the scan.  The probe sequence in CURRENT as well as 6.1 handles
>> CAM_SEL_TIMEOUT a little differently depending on context.

Yeah I spoke too soon.  I think that is a red herring though and
misinterpretation of what that was really doing (in this case just
seeing the device as unconfigured and moving on).

But I STILL don't understand why its treated as a AC_LOST_DEVICE event
at scan time (i.e. more overhead than really necessary but perhaps I
am not thinking of all the possibilities down this code path, i.e. why
create a path, then call xpt_asyc, all to just set the flag as
unconfigured - perhaps its more align with the model than anything
else and I'm reading too much into it).

> It's not at all clear to me what is going on here.  Can you instrument the
> code to record the status of everything that is being issued to the aac_cam
> module?

Yes surely.   I think what might be happening is that after the
INQUIRY fails, xpt_release_ccb() which I think will also check to see
if any more CCBs should be sent to the device and send them.
Basically the boot -v output is I am getting a CAM_SEL_TIMEOUT for
each target and just hit into the 500 interrupt storm default
threshold on 6.1.

Let me investigate further...I'm on the right track, but I need to
instrument more...Scott its my first time playing with CAM (be
gentle).  :D

-aps