Re: [net] protecting interfaces from races between control and data ?

From: Luigi Rizzo <rizzo_at_iet.unipi.it>
Date: Mon, 5 Aug 2013 23:53:19 +0200
On Mon, Aug 05, 2013 at 11:04:44PM +0200, Andre Oppermann wrote:
> On 05.08.2013 19:36, Luigi Rizzo wrote:
...
> 
> [picking a post at random to reply in this thread]
> > tell whether or not we should bail out).
> 
> Ideally we don't want to have any locks in the RX and TX path at all.

Ok i have read your proposal below but there are a few things
I do not completely understand, below -- essentially I do not
understand whether the removal of IFF_DRV_RUNNING (or equivalent)
forces you to replace the ifp->new_tx_function() with something
else when you want to do an "ifconfig down"

Specifically, here are my doubts:

- Considering that the rxq lock is rarely contended
  (only when the control path is active, which in your approach
  below requires killing and restarting the ithread),
  and is acquired/released only once per interrupt/batch,
  i am under the impression that the per-queue RX lock is
  not a performance problem and makes it easier to reason
  on the code (and does not require different approach
  for MSI-x and other cases).

- in the tx datapath, do you want to acquire the txq lock
  before or after calling ifp->new_tx_function() ?
  (btw how it compares to ifp->if_start or ifp->if_transmit ?)
  If you do it within the tx handler then you lose the ability
  of replacing the handler when you do a reconfiguration,
  because grabbing the tx lock in the control path does not tell
  you whether anyone is in the handler.
  If you do it outside, then the driver should also expose the locks,
  or some locking function.

Overall, it seems to me that keeping the IFF_DRV_RUNNING
flag is a lot more practical, not to mention fewer modifications
to the code.

[description of Andre's proposal below, for reference]

	cheers
	luigi

> Every queue has its own separate RX interrupt vector which is serviced by
> a single dedicated ithread (or taskqueue) which does while(1) for work on
> the RX ring only going to sleep when it reaches the end of work on the ring.
> It is then woken up by an interrupt to resume work.  To prevent a live-lock
> on the CPU it would periodically yield to allow other kernel and user-space
> threads to run in between.  Each queue's ithread (taskqueue) can be pinned
> to a particular core or float with a certain stickiness.  To reconfigure the
> card (when the RX ring is affected) the ithread (taskqueue) is terminated
> and after the changes restarted again.  This is done by the control path
> and allows us to run RX completely lock free because there is only ever one
> ithread accessing the RX queue doing away with the need for further locking
> synchronization.

ok I admit i did not think of killing and restarting the ithread,
but i wonder how 
> RX with MSI or legacy interrupts:
> 
> Here multi-queue is impeded because of some synchronization issues.  Depending
> on the hardware synchronization model the ithreads again do while(1) but have
> to be made runnable by the interrupt handler when new work has been indicated.
> 
> TX in general:
> 
> This is a bit more tricky as we have the hard requirement of packet ordering
> for individual sessions (tcp and others).  That means we must use the same
> queue for all packets of the same session.  This makes running TX non-locked
> a bit tricky because when we pin a TX queue to a core we must switch to that
> core first before adding anything to the queue.  If the packet was generated
> on that core there is no overhead, OTOH if not there is the scheduler over-
> head and some cache losses.  Ensuring that a the entire TX path, possibly
> including user-space (write et al) happens on the same core is difficult and
> may have its own inefficiencies.  The number of TX queue vs. number of cores
> is another point of contention.  To make the 1:1 scheme work well, one must
> have as many queues as cores.
> 
> A more scalable and generic approach doesn't bind TX queues to any particular
> core and is covers each by its own lock to protect the TX ring enqueue.  To
> prevent false lock cache line sharing each TX structure should be on its own
> cache line.  As each session has an affinity hash they become distributed
> across the TX queues significantly reducing contention.
> 
> The TX complete handling freeing the mbuf(s) is done the same way as for the
> RX ring with a dedicated ithread (taskqueue) for each.  Also bpf should hook
> in here (the packet has been sent) instead of at the TX enqueue time.
> 
> The whole concept of IFF_DRV_RUNNING will go away along with the IFQ macros.
> Each driver then provides a direct TX function pointer which is put into a
> decoupled ifnet TX field for use by the stack.  Under normal operation all
> interface TX goes directly into TX DMA ring.  The particular multi-queue
> and locking model is decided by the driver.  The kernel provides a couple
> of common optimized abstractions for use by all drivers that want/need it
> to avoid code and functionality duplication.  When things like ALTQ are
> activated on an interface the ifnet TX function pointer is replaced with
> the appropriate intermediate function pointer which eventually will call
> the drivers TX function.  The intermediate TX packet handler (ALTQ) can
> do its own additional locking on top of the drivers TX locking.
> 
> Control path:
> 
> The control path is separate from the high speed TX and RX data paths.
> It has its own overall lock.  Multiple locks typically don't make sense
> here.  If it needs to modify, reconfigure, or newly set up the TX or RX
> rings then it has to stop the RX ithreads (taskqueues) and to lock the
> TX queue locks before it can do that.  After the changes are made and
> stable packet processing can continue.
> 
> I've adjusted / heavily modified fxp, bge and igb drivers in my tcp_workqueue
> branch to test these concepts.  Not all of it is committed or fully up to date.
> 
> -- 
> Andre
> 
Received on Mon Aug 05 2013 - 19:48:57 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:40 UTC