Re: Broadcom bge(4) panics while shutting down

From: Alexander Sack <pisymbol_at_gmail.com>
Date: Thu, 14 May 2009 16:01:33 -0400
On Thu, May 14, 2009 at 3:47 PM, Xin LI <delphij_at_delphij.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi, Alexander,
>
> Alexander Sack wrote:
>> Hello:
>>
>> Under heavy traffic (100% utilization GIGE on a 2 port BGE card)
>> running BGE CURRENT driver I see panics on shutdown.  The reason is
>> because bge_rxeof() while processing its RX ring of BD's drops the
>> softc lock when it hands it off to its input function.  If bge_stop()
>> is waiting for it, it will then proceed to acquire lock and then
>> quiesce the hardware (reseting the card, clearing out BDs etc.).  Once
>> bge_stop() releases the softc lock, then bge_rxeof() under an
>> interrupt context (no polling here) will reacquire and continue to
>> process the ring which is a bad idea.  It should check to see if the
>> card is still running before continuing processing BDs (i.e. once
>> IF_DRV_RUNNING has been reset by bge_stop(), bge_rxeof() is done, bail
>> out).
>>
>> Here is my first go around with this patch:
>>
>>
>> -- if_bge.c.CURRENT   2009-05-14 14:39:39.000000000 -0400
>> +++ if_bge.c  2009-05-14 14:39:24.000000000 -0400
>> _at__at_ -3081,6 +3081,10 _at__at_
>>               uint16_t                vlan_tag = 0;
>>               int                     have_tag = 0;
>>
>> +             if (!(ifp->if_drv_flags & IFF_DRV_RUNNING)) {
>> +                     return;
>> +             }
>> +
>>  #ifdef DEVICE_POLLING
>>               if (ifp->if_capenable & IFCAP_POLLING) {
>>                       if (sc->rxcycles <= 0)
>>
>>
>> This prevents any panics during shutdown under heavy load and AS IT
>> TURNS out (I feel stupid for not looking) that em(4) already had this
>> check in its em_rxeof() function (right at the top of the loop).  I'm
>> more than happy changing it to the em style but above seems reasonable
>> to me though I have to verify there isn't anything missing off the
>> loop from a hardware standpoint (I don't think so because bge_stop()
>> did all the dirty work so I believe touching any registers after that
>> from bge_rxeof() is a bad idea).
>>
>> Preliminary testing shows no more panics start and stopping ports
>> under heavy load (panics were almost immediate otherwise).
>>
>> Thoughts?
>
> I think this would solve the problem but I'm not sure whether this would
> increase some overhead on the RX path.  It seems that there is a race
> between bge_release_resources() and bge_intr(), I mean, it might be a
> good idea to "drain" bge_intr() instead?

Are you talking about detach time?  Because bge_stop() gets called
before bge_release_resources() and stops host interrupts so where is
the race again?  I mean at this point no more interrupts should be
delivered to bge_intr() (I can confirm from spec since BGE has
released it in the wild).  So why would you "drain" it at this
point....(the hardware is down including the firmware).

I agree it adds a little overhead to the standard bge_rxeof() path
which I agree is very sensitive to change.  However, I think the check
at top is tolerable since the other recourse is crash.  I mean its
very easy to reproduce.  Flood a Broadcom card with traffic then stop
the card and let the race begin...it will go down in bge_rxeof() after
bge_stop releases the lock.

I actually did not look at changing anything structurally to perhaps
make this whole predicament better but minimally there should be a
shield against this no?

-aps
Received on Thu May 14 2009 - 18:01:34 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:47 UTC