Re: [net] protecting interfaces from races between control and data ?

From: Andre Oppermann <andre_at_freebsd.org> Date: Tue, 06 Aug 2013 18:43:47 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:40 UTC

On 05.08.2013 23:53, Luigi Rizzo wrote:
> On Mon, Aug 05, 2013 at 11:04:44PM +0200, Andre Oppermann wrote:
>> On 05.08.2013 19:36, Luigi Rizzo wrote:
> ...
>>
>> [picking a post at random to reply in this thread]
>>> tell whether or not we should bail out).
>>
>> Ideally we don't want to have any locks in the RX and TX path at all.
>
> Ok i have read your proposal below but there are a few things
> I do not completely understand, below -- essentially I do not
> understand whether the removal of IFF_DRV_RUNNING (or equivalent)
> forces you to replace the ifp->new_tx_function() with something
> else when you want to do an "ifconfig down"

Sorry, it's IFF_DRV_OACTIVE that'll go away, not IFF_DRV_RUNNING.
It's of no use with multi-queue anyways.  Though we may get more
differentiated states for interface availability.

> Specifically, here are my doubts:
>
> - Considering that the rxq lock is rarely contended
>    (only when the control path is active, which in your approach
>    below requires killing and restarting the ithread),
>    and is acquired/released only once per interrupt/batch,
>    i am under the impression that the per-queue RX lock is
>    not a performance problem and makes it easier to reason
>    on the code (and does not require different approach
>    for MSI-x and other cases).

There are two important things here: 1) there must only be one
(i)thread per RX queue to prevent lock contention; 2) even with
a non-contended lock there are effects felt by the other cores
or CPUs like bus lock cycles which are considerable.  Stopping
and restarting the ithread/taskqueue in those cases that require
more involved (hardware) reconfiguration isn't much of an overhead
compared to per-packet or per-packet-batch locking in the hot path.

> - in the tx datapath, do you want to acquire the txq lock
>    before or after calling ifp->new_tx_function() ?
>    (btw how it compares to ifp->if_start or ifp->if_transmit ?)

Struct ifnet is going to be split into two parts, the stack owned
part and the driver owned part.  The stack will never fumble the
driver owned parts and the driver must only change the stack owned
parts through accessor functions.  (This split has also been proposed
by Juniper quite some time ago but for different reasons).

The driver supplies a TX frame transmit function (mostly like if_transmit
today) which does all locking and multi-queue handling internally (driver
owned.  This gives driver writers the freedom to better adjust to different
hardware communication methods as they become available, like we witnessed
a couple of times in the past.

If the driver is multi-queue capable it takes the rss-hash from the packet header
to select an appropriate queue which may have its own dedicated lock.  If there
is only one queue then it will obtain that lock which may see some contention
when multiple cores try to send at the same time.  The driver may do more
extensive locking, however that likely comes at the expense of performance.

Typically on modern NICs the TX function will be a thin locked wrapper around
the DMA enqueue function.  For or older or other kinds of interfaces it may
implement full internal soft queuing (as the IFQ did).  An efficient generic
implementation of those will be provided for driver to make use of.

 >    If you do it within the tx handler then you lose the ability
 >    of replacing the handler when you do a reconfiguration,
 >    because grabbing the tx lock in the control path does not tell
 >    you whether anyone is in the handler.
 >    If you do it outside, then the driver should also expose the locks,
 >    or some locking function.

The stack calls the transmit function without any driver-specific locks held.
It'll make sure though that the stack-ifnet doesn't go away in between probably
by using cheap rmlocks.

The drivers control path gets a function to ensure that the stack has drained
all writers and it is safe to reconfigure (as in callout_drain()).  Not all
hardware and control path changes necessarily require a reinitialization.

The stack-ifnet shadows some information, like interface state, and may do its
own independent SMP optimizations to avoid contention.

> Overall, it seems to me that keeping the IFF_DRV_RUNNING
> flag is a lot more practical, not to mention fewer modifications
> to the code.

Generally we want to optimize for the fast packet path, reduce any friction we
can, and take a hit on reconfig being more "expensive" as it is much less
frequent.

Mind you not all of this is worked out in detail yet and may be subject to
further changes and refinements as more benchmarking of different approaches
is performed.

-- 
Andre

[PS: I'm currently on summer vacation and write this while having access]

> [description of Andre's proposal below, for reference]
>
> 	cheers
> 	luigi
>
>> Every queue has its own separate RX interrupt vector which is serviced by
>> a single dedicated ithread (or taskqueue) which does while(1) for work on
>> the RX ring only going to sleep when it reaches the end of work on the ring.
>> It is then woken up by an interrupt to resume work.  To prevent a live-lock
>> on the CPU it would periodically yield to allow other kernel and user-space
>> threads to run in between.  Each queue's ithread (taskqueue) can be pinned
>> to a particular core or float with a certain stickiness.  To reconfigure the
>> card (when the RX ring is affected) the ithread (taskqueue) is terminated
>> and after the changes restarted again.  This is done by the control path
>> and allows us to run RX completely lock free because there is only ever one
>> ithread accessing the RX queue doing away with the need for further locking
>> synchronization.
>
> ok I admit i did not think of killing and restarting the ithread,
> but i wonder how
>> RX with MSI or legacy interrupts:
>>
>> Here multi-queue is impeded because of some synchronization issues.  Depending
>> on the hardware synchronization model the ithreads again do while(1) but have
>> to be made runnable by the interrupt handler when new work has been indicated.
>>
>> TX in general:
>>
>> This is a bit more tricky as we have the hard requirement of packet ordering
>> for individual sessions (tcp and others).  That means we must use the same
>> queue for all packets of the same session.  This makes running TX non-locked
>> a bit tricky because when we pin a TX queue to a core we must switch to that
>> core first before adding anything to the queue.  If the packet was generated
>> on that core there is no overhead, OTOH if not there is the scheduler over-
>> head and some cache losses.  Ensuring that a the entire TX path, possibly
>> including user-space (write et al) happens on the same core is difficult and
>> may have its own inefficiencies.  The number of TX queue vs. number of cores
>> is another point of contention.  To make the 1:1 scheme work well, one must
>> have as many queues as cores.
>>
>> A more scalable and generic approach doesn't bind TX queues to any particular
>> core and is covers each by its own lock to protect the TX ring enqueue.  To
>> prevent false lock cache line sharing each TX structure should be on its own
>> cache line.  As each session has an affinity hash they become distributed
>> across the TX queues significantly reducing contention.
>>
>> The TX complete handling freeing the mbuf(s) is done the same way as for the
>> RX ring with a dedicated ithread (taskqueue) for each.  Also bpf should hook
>> in here (the packet has been sent) instead of at the TX enqueue time.
>>
>> The whole concept of IFF_DRV_RUNNING will go away along with the IFQ macros.
>> Each driver then provides a direct TX function pointer which is put into a
>> decoupled ifnet TX field for use by the stack.  Under normal operation all
>> interface TX goes directly into TX DMA ring.  The particular multi-queue
>> and locking model is decided by the driver.  The kernel provides a couple
>> of common optimized abstractions for use by all drivers that want/need it
>> to avoid code and functionality duplication.  When things like ALTQ are
>> activated on an interface the ifnet TX function pointer is replaced with
>> the appropriate intermediate function pointer which eventually will call
>> the drivers TX function.  The intermediate TX packet handler (ALTQ) can
>> do its own additional locking on top of the drivers TX locking.
>>
>> Control path:
>>
>> The control path is separate from the high speed TX and RX data paths.
>> It has its own overall lock.  Multiple locks typically don't make sense
>> here.  If it needs to modify, reconfigure, or newly set up the TX or RX
>> rings then it has to stop the RX ithreads (taskqueues) and to lock the
>> TX queue locks before it can do that.  After the changes are made and
>> stable packet processing can continue.
>>
>> I've adjusted / heavily modified fxp, bge and igb drivers in my tcp_workqueue
>> branch to test these concepts.  Not all of it is committed or fully up to date.
>>
>> --
>> Andre
>>
>
>