RE: Problem with twa in HEAD

From: Vinod Kashyap <vkashyap_at_amcc.com>
Date: Fri, 29 Apr 2005 10:15:09 -0700
> -----Original Message-----
> From: Scott Long [mailto:scottl_at_samsco.org]
> Sent: Thursday, April 28, 2005 11:57 PM
> To: Vinod Kashyap
> Cc: Bjoern A. Zeeb; freebsd-current_at_FreeBSD.org
> Subject: Re: Problem with twa in HEAD
> 
> 
> Vinod Kashyap wrote:
> >
> >>-----Original Message-----
> >>From: Bjoern A. Zeeb [mailto:bz_at_FreeBSD.org]
> >>Sent: Tuesday, April 26, 2005 3:26 AM
> >>To: Vinod Kashyap
> >>Subject: RE: Problem with twa in HEAD
> >>
> >>
> >>On Mon, 25 Apr 2005, Vinod Kashyap wrote:
> >>
> >>Hi,
> >>
> >>
> >>>>-----Original Message-----
> >>>>From: Bjoern A. Zeeb [mailto:bz_at_FreeBSD.org]
> >>>>Sent: Monday, April 25, 2005 6:45 AM
> >>>>To: Vinod Kashyap
> >>>>Subject: Re: Problem with twa in HEAD
> >>>>
> >>>>
> >>>>On Fri, 22 Apr 2005, Bjoern A. Zeeb wrote:
> >>>>
> >>>>Hi,
> >>>>
> >>>>
> >>>>>scottl redirected me to you.
> >>>>>
> >>>>>I am currently debugging "hangs" on reboot and shutdown on a
> >>>>>SMP machine with 12 discs at a
> >>>>>
> >>>>>3ware device driver for 9000 series storage controllers,
> >>>>
> >>>>version: 3.60.00.016
> >>>>
> >>>>>twa0: <3ware 9000 series Storage Controller> port
> >>>>
> >>>>0x9800-0x98ff mem 0xfe8ffc00-0xfe8ffcff,0xfb800000-0xfbffffff
> >>>>irq 28 at device 6.0 on pci3
> >>>>
> >>>>>twa0: [FAST]
> >>>>>twa0: INFO: (0x15: 0x1300): Controller details:: 12 ports,
> >>>>
> >>>>Firmware FE9X 2.06.00.009, BIOS BE9X 2.03.01.051
> >>>>
> >>>>>
> >>>>>What I know so far is that Giant is held by sync.
> >>>>>
> >>>>>Things a "spinning" in cam/cam_xpt.c around:
> >>>>>
> >>>>>--- cam_xpt.c   31 Mar 2005 21:42:49 -0000      1.152
> >>>>>+++ cam_xpt.c   22 Apr 2005 18:42:43 -0000
> >>>>>_at__at_ -3643,6 +3643,7 _at__at_ xpt_polled_action(union ccb *start_ccb)
> >>>>>                            != CAM_REQ_INPROG)
> >>>>>                                break;
> >>>>>                        DELAY(1000);
> >>>>>                        printf("XXX status=%02x\n",
> >>>>
> >>>>start_ccb->ccb_h.status);
> >>>>
> >>>>>                }
> >>>>>                if (timeout == 0) {
> >>>>>                        /*
> >>>>>
> >>>>>
> >>>>>with status being 0x200.
> >>>>>
> >>>>>Seems the twa has a command stuck in it.
> >>>>>
> >>>>>I have seen the comment in dev/twa/tw_osl_cam.c ~ line 253 about
> >>>>>queuing and CAM_SIM_QUEUED but I don't know enough about cam.
> >>>>>I seems no all patchs out of this functions seem to
> >>
> >>clear that from
> >>
> >>>>>status?
> >>>>>
> >>>>>Any help apreaciated ;) I can try patches; as long as I
> >>
> >>can break
> >>
> >>>>>to db> to reboot.
> >>>>
> >>>>further debugging shows that is seems to be spinning in twa_poll.
> >>>>see debug output from TWA_DEBUG 3. The problem is that at
> >>
> >>this point
> >>
> >>>>I am no longer able to break to debugger.
> >>>>
> >>>>twa0: tw_osli_execute_scsi: XPT_SCSI_IO: Single virtual address!
> >>>>twa0: tw_osli_execute_scsi: XPT_SCSI_IO: Single virtual address!
> >>>>unmount of /dev failed (BUSY)
> >>>>twa0: tw_osli_execute_scsi: XPT_SCSI_IO: Single virtual address!
> >>>>twa0: tw_osli_execute_scsi: XPT_SCSI_IO: Single virtual address!
> >>>>Uptime: 2m57s
> >>>>twa0: tw_osli_execute_scsi: XPT_SCSI_IO: Single virtual address!
> >>>>twa0: twa_poll: entering; sc = 0xc57bb200
> >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
> >>>>twa0: twa_poll: entering; sc = 0xc57bb200
> >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
> >>>>twa0: twa_poll: entering; sc = 0xc57bb200
> >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
> >>>>twa0: twa_poll: entering; sc = 0xc57bb200
> >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
> >>>>twa0: twa_poll: entering; sc = 0xc57bb200
> >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
> >>>>twa0: twa_poll: entering; sc = 0xc57bb200
> >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
> >>>>twa0: twa_poll: entering; sc = 0xc57bb200
> >>>>twa0: twa_poll: exiting; sc = 0xc57bb200
> >>>>...
> >>>>
> >>>
> >>>I am in the middle of an office move right now.
> >>>I will get back to you once I have some time to look into this.
> >>
> >>
> >>thanks for the information; I'll be able to test at least 
> until end of
> >>this week and hopefully next week too.
> >>
> >
> >
> > I looked into this, and this is what is happening:
> > On reboot/halt, the following function calling sequence happens:
> > ... --> dashutdown --> xpt_polled_action --> twa_poll.
> > But, the interrupt handler in twa is still active at this time,
> > since twa_detach/twa_shutdown hasn't been called yet.  Before
> > twa_poll can fetch the response for the posted command, the ISR
> > gets called when the firmware posts the response.  The ISR clears
> > the interrupt bit on the controller, registers a taskqueue 
> handler like
> > it always does, and exits.  Meanwhile, xpt_polled_action continues
> > to call twa_poll, which cannot determine that the command 
> has completed,
> > since the interrupt bit on the controller is already cleared.  So,
> > we get into a (near) never-ending loop (the timeout for 
> scsi_synchronize_cache,
> > which is what is being tried here, is, for whatever reason, 
> 60 minutes,
> > and so, the system is as good as hung).
> >
> > Now, does anyone know why xpt_polled_action is being called from
> > dashutdown, even before the ISR has been unregistered (via 
> twa_detach)?
> >
> > Bjoern, this patch should work-around your problem, 
> although it's not
> > the fix.  Also, it still leaves a window for the race 
> condition described
> > above.
> >
> 
> xpt_polled_action() expects that it can simulate interrupts by calling
> the driver poll vector, and that by calling it enough times the driver
> will eventually complete all the outstanding I/O it has.  As you note,
> it'll repeat this for a very long time.  So the question is 
> then why the
> twa driver isn't completing the outstanding I/O.  If I were you I'd
> remove the call to tw_cl_interrupt() in twa_poll() and just
> unconditionally call tw_cl_deferred_interrupt() and have it check
> everything.

The call to tw_cl_interrupt cannot be removed as that's where the interrupt
is cleared.  However, the call to tw_cl_deferred_interrupt can be made
not conditional to the return value from tw_cl_interrupt.
Whatever that is, my question is why polling is being resorted to, when
interrupts are available?

>  The locking here (and in twa_pci_intr()) is 
> flawed anyways,
> you have a race between when tw_cl_interrupt() drops its lock right
> before return and when you check it's return value.  I'd like say that
> it's harmless, except that you expect to pass state from one 
> function to
> the next, so the race is a real one.  It's likely why this case is

If the state passed to the second function is invalid when the second
function executes, it just does nothing.  So, this is pretty harmless.

> failing.  An ideal FAST handler should only clear the 
> hardware interrupt
> register and launch the appropriate handlers, it shouldn't try to pass
> state to the handlers.  Look at aac for an example here, but 
> also please
> recall that I've already discouraged you from using a a fast handler
> plus taskqueue for this driver.  If your taskqueue handlers need state
> from when the interrupt was cleared, then they simply aren't a good
> candidate for this model.
> 

In twa, state will need to be passed to the taskqueue handlers since the
ISR clears the state on the controller.
Received on Fri Apr 29 2005 - 15:15:14 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:33 UTC