Re: [net] protecting interfaces from races between control and data ?

From: Andre Oppermann <andre_at_freebsd.org> Date: Mon, 05 Aug 2013 23:04:44 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:40 UTC

On 05.08.2013 19:36, Luigi Rizzo wrote:
> On Mon, Aug 5, 2013 at 7:17 PM, Adrian Chadd <adrian_at_freebsd.org> wrote:
>
>> I'm travelling back to San Jose today; poke me tomorrow and I'll brain
>> dump what I did in ath(4) and the lessons learnt.
>>
>> The TL;DR version - you don't want to grab an extra lock in the
>> read/write paths as that slows things down. Reuse the same per-queue
>> TX/RX lock and have:
>>
>> * a reset flag that is set when something is resetting; that says to
>> the queue "don't bother processing anything, just dive out";
>> * 'i am doing Tx / Rx' flags per queue that is set at the start of
>> TX/RX servicing and finishes at the end; that way the reset code knows
>> if there's something pending;
>> * have the reset path grab each lock, set the 'reset' flag on each,
>> then walk each queue again and make sure they're all marked as 'not
>> doing TX/RX'. At that point the reset can occur, then the flag cna be
>> cleared, then TX/RX can resume.
>>

[picking a post at random to reply in this thread]

> so this is slightly different from what Bryan suggested (and you endorsed)
> before, as in that case there was a single 'reset' flag IFF_DRV_RUNNING
> protected by the 'core' lock, then a nested round on all tx and rx locks
> to make sure that all customers have seen it.
> In both cases the tx and rx paths only need the per-queue lock.
>
> As i see it, having a per-queue reset flag removes the need for nesting
> core + queue locks, but since this is only in the control path perhaps
> it is not a big deal (and is better to have a single place to look at to
> tell whether or not we should bail out).

Ideally we don't want to have any locks in the RX and TX path at all.

For the TX path this is not easy to achieve but for RX it isn't difficult
at all provided the correct model is used.

My upcoming stack/driver proposal is along these lines:

RX with MSI-X:

Every queue has its own separate RX interrupt vector which is serviced by
a single dedicated ithread (or taskqueue) which does while(1) for work on
the RX ring only going to sleep when it reaches the end of work on the ring.
It is then woken up by an interrupt to resume work.  To prevent a live-lock
on the CPU it would periodically yield to allow other kernel and user-space
threads to run in between.  Each queue's ithread (taskqueue) can be pinned
to a particular core or float with a certain stickiness.  To reconfigure the
card (when the RX ring is affected) the ithread (taskqueue) is terminated
and after the changes restarted again.  This is done by the control path
and allows us to run RX completely lock free because there is only ever one
ithread accessing the RX queue doing away with the need for further locking
synchronization.

RX with MSI or legacy interrupts:

Here multi-queue is impeded because of some synchronization issues.  Depending
on the hardware synchronization model the ithreads again do while(1) but have
to be made runnable by the interrupt handler when new work has been indicated.

TX in general:

This is a bit more tricky as we have the hard requirement of packet ordering
for individual sessions (tcp and others).  That means we must use the same
queue for all packets of the same session.  This makes running TX non-locked
a bit tricky because when we pin a TX queue to a core we must switch to that
core first before adding anything to the queue.  If the packet was generated
on that core there is no overhead, OTOH if not there is the scheduler over-
head and some cache losses.  Ensuring that a the entire TX path, possibly
including user-space (write et al) happens on the same core is difficult and
may have its own inefficiencies.  The number of TX queue vs. number of cores
is another point of contention.  To make the 1:1 scheme work well, one must
have as many queues as cores.

A more scalable and generic approach doesn't bind TX queues to any particular
core and is covers each by its own lock to protect the TX ring enqueue.  To
prevent false lock cache line sharing each TX structure should be on its own
cache line.  As each session has an affinity hash they become distributed
across the TX queues significantly reducing contention.

The TX complete handling freeing the mbuf(s) is done the same way as for the
RX ring with a dedicated ithread (taskqueue) for each.  Also bpf should hook
in here (the packet has been sent) instead of at the TX enqueue time.

The whole concept of IFF_DRV_RUNNING will go away along with the IFQ macros.
Each driver then provides a direct TX function pointer which is put into a
decoupled ifnet TX field for use by the stack.  Under normal operation all
interface TX goes directly into TX DMA ring.  The particular multi-queue
and locking model is decided by the driver.  The kernel provides a couple
of common optimized abstractions for use by all drivers that want/need it
to avoid code and functionality duplication.  When things like ALTQ are
activated on an interface the ifnet TX function pointer is replaced with
the appropriate intermediate function pointer which eventually will call
the drivers TX function.  The intermediate TX packet handler (ALTQ) can
do its own additional locking on top of the drivers TX locking.

Control path:

The control path is separate from the high speed TX and RX data paths.
It has its own overall lock.  Multiple locks typically don't make sense
here.  If it needs to modify, reconfigure, or newly set up the TX or RX
rings then it has to stop the RX ithreads (taskqueues) and to lock the
TX queue locks before it can do that.  After the changes are made and
stable packet processing can continue.

I've adjusted / heavily modified fxp, bge and igb drivers in my tcp_workqueue
branch to test these concepts.  Not all of it is committed or fully up to date.

-- 
Andre