Re: network slowness/freez-up since update 10/11

From: Robert Watson <rwatson_at_FreeBSD.ORG>
Date: Thu, 14 Oct 2004 15:55:55 -0400 (EDT)
On Thu, 14 Oct 2004, Ian FREISLICH wrote:

> Andrey Chernov wrote:
> > > You mean, until rwatson changed the default to debug.mpsafenet=1? :-)
> > 
> > Your guess is precisely right! :-)
> > 
> > (IMHO making such commit without testing major drivers such as if_de was
> > wrong step)
> 
> I always thought the spin on debug.mpsafenet=1 with if_de was YYMV. 
> There were many calls for the maintainers of the driver to fix it, but
> zero response IIRC.  Maybe making it on by default was a little hasty,
> but anyone that follows -CURRENT like they should if they run it weuld
> have been aware of this and set debug.mpsafenet=0 in their loader.conf
> when they saw that commit. 

(Kind comments on handling of mpsafenet work ommitted in quote, but much
appreciated).

I was chatting wit Max Laier this evening, and he suggested that he was
worried that the ALTQ changes might actually be the problem.  He has
created a small patch to back those changes out, as well as a change to
tweak the behavior.  You can find the patches here:

    http://people.freebsd.org/~mlaier/if_de.c.backout.diff

    http://people.freebsd.org/~mlaier/if_de.c.drvlen.diff

I looked at the queueing pieces yesterday but didn't see any obvious
problems with them.  I think it's worth trying each of these patches to
see if one of them has the desired effect, however.  The problem appears
to lie somehow in the hand-off between the network stack and driver, as
that's the primary difference between the debug.mpsafenet={0,1} cases.

FYI, here are some things we've tried looking at so far:

- We thought there might be a race in the handling of IFF_OACTIVE and its
  use in if_handoff(), since IFF_OACTIVE is used differently in if_de that
  most drivers.  However, removing the IFF_OACTIVE test in iff_handoff()
  did not resolve the problem in John's configuration.

- We were concerned there was a race in the task queue handoff used to
  schedule the interface start routine asynchronusly from the queue
  insert.  We instrumented the task queue code with timing and didn't find
  anything abnormal (i.e., no waits long enough to explain the observed
  delays).

So it seems likely to be one of the two following sorts of things:

- A problem in the if_de driver, perhaps due to less Giant on the rest of
  the stack, that causes it to improperly move data in and out of the
  interface queues, or monitor for entires in the queue, resulting in
  delays.

- A race introduced by Giant removal wherein the if_de driver behaves
  incorrectly if a packet is found in the ifq by the interrupt handler if
  the tulip_start function has not yet been run for that packet.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert_at_fledge.watson.org      Principal Research Scientist, McAfee Research
Received on Thu Oct 14 2004 - 17:57:39 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:17 UTC