Re: nve locking fixes round 2

From: Matthew Dillon <dillon_at_apollo.backplane.com> Date: Thu, 24 Nov 2005 15:29:14 -0800 (PST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:48 UTC

:Ok, now that the first set of locking overhaul is in the tree, can folks with 
:working nve(4) adapters test the patch referenced below and make sure there 
:are no regressions.  Having the IFF_UP fiddling turned off may or may not 
:help folks getting the TX timeouts as well, btw, so if people are feeling 
:brave they can try this patch as well.  Note it is only applicable to recent 
:current.
:
:http://www.FreeBSD.org/~jhb/patches/nve_locking.patch
:
:-- 
:John Baldwin <jhb_at_FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
:"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org

    The reason I set sc->pending_txs to 0 in DFly after the reinit is
    because when a watchdog timeout occurs and you reset the device,
    *ALL* mbufs still sitting in the transmit ring are lost.  They will
    never be acknowledged, ever.  So pending_txs will never drop back to 0 on
    its own.  This is what led to continuous watchdog timeout reports
    when, in fact, only one timeout actually occured.

    The FreeBSD code does set pending_txs to 0 in nve_stop().  I'm not
    sure this is correct, however, unless the pfnStop() ABI call cleans
    out pending mbufs in the transmit ring (which seems unlikely).  The
    count would wind up going negative.

    Another problem that neither of us has dealt with yet is recovery of
    dead transmit mbufs.  Right now that only occurs in nve_ospackettx(),
    but nve_ospackettx() is only called by the Nvidia code during normal
    operation.  ABI calls to e.g. reset the Nvidia device will *NOT* 
    clean out the transmit ring and call nve_ospackettx(), so we lose track
    of all the mbufs that were sitting in there at the time of a reinit.

    But, of course, the biggest problem is simply the fact that the NVidia
    ABI library seems to be rather broken.  On my nForce4-based boxes the
    DFly driver can recover from numerous watchdog timeouts (and they occur
    quite often, even when the network load is virtually nil), but after an
    hour or two of testing at GiGE speeds the hardware itself stops working
    entirely, to the point where I have to physically unplug and replug
    the power cord for the machine for the hardware to start working again.

					-Matt
					Matthew Dillon 
					<dillon_at_backplane.com>