thanks for the explanations and for experimenting with the various alternatives. I started this thread just to understand whether something was already in place, and to make sure that what I do with netmap is not worse than the situation we have now. I guess that while the best solution comes out, a quick and non intrusive workaround is at least follow the approach that Bryan suggested. cheers luigi On Tue, Aug 6, 2013 at 6:43 PM, Andre Oppermann <andre_at_freebsd.org> wrote: > On 05.08.2013 23:53, Luigi Rizzo wrote: > >> On Mon, Aug 05, 2013 at 11:04:44PM +0200, Andre Oppermann wrote: >> >>> On 05.08.2013 19:36, Luigi Rizzo wrote: >>> >> ... >> >>> >>> [picking a post at random to reply in this thread] >>> >>>> tell whether or not we should bail out). >>>> >>> >>> Ideally we don't want to have any locks in the RX and TX path at all. >>> >> >> Ok i have read your proposal below but there are a few things >> I do not completely understand, below -- essentially I do not >> understand whether the removal of IFF_DRV_RUNNING (or equivalent) >> forces you to replace the ifp->new_tx_function() with something >> else when you want to do an "ifconfig down" >> > > Sorry, it's IFF_DRV_OACTIVE that'll go away, not IFF_DRV_RUNNING. > It's of no use with multi-queue anyways. Though we may get more > differentiated states for interface availability. > > > Specifically, here are my doubts: >> >> - Considering that the rxq lock is rarely contended >> (only when the control path is active, which in your approach >> below requires killing and restarting the ithread), >> and is acquired/released only once per interrupt/batch, >> i am under the impression that the per-queue RX lock is >> not a performance problem and makes it easier to reason >> on the code (and does not require different approach >> for MSI-x and other cases). >> > > There are two important things here: 1) there must only be one > (i)thread per RX queue to prevent lock contention; 2) even with > a non-contended lock there are effects felt by the other cores > or CPUs like bus lock cycles which are considerable. Stopping > and restarting the ithread/taskqueue in those cases that require > more involved (hardware) reconfiguration isn't much of an overhead > compared to per-packet or per-packet-batch locking in the hot path. > > > - in the tx datapath, do you want to acquire the txq lock >> before or after calling ifp->new_tx_function() ? >> (btw how it compares to ifp->if_start or ifp->if_transmit ?) >> > > Struct ifnet is going to be split into two parts, the stack owned > part and the driver owned part. The stack will never fumble the > driver owned parts and the driver must only change the stack owned > parts through accessor functions. (This split has also been proposed > by Juniper quite some time ago but for different reasons). > > The driver supplies a TX frame transmit function (mostly like if_transmit > today) which does all locking and multi-queue handling internally (driver > owned. This gives driver writers the freedom to better adjust to different > hardware communication methods as they become available, like we witnessed > a couple of times in the past. > > If the driver is multi-queue capable it takes the rss-hash from the packet > header > to select an appropriate queue which may have its own dedicated lock. If > there > is only one queue then it will obtain that lock which may see some > contention > when multiple cores try to send at the same time. The driver may do more > extensive locking, however that likely comes at the expense of performance. > > Typically on modern NICs the TX function will be a thin locked wrapper > around > the DMA enqueue function. For or older or other kinds of interfaces it may > implement full internal soft queuing (as the IFQ did). An efficient > generic > implementation of those will be provided for driver to make use of. > > > > If you do it within the tx handler then you lose the ability > > of replacing the handler when you do a reconfiguration, > > because grabbing the tx lock in the control path does not tell > > you whether anyone is in the handler. > > If you do it outside, then the driver should also expose the locks, > > or some locking function. > > The stack calls the transmit function without any driver-specific locks > held. > It'll make sure though that the stack-ifnet doesn't go away in between > probably > by using cheap rmlocks. > > The drivers control path gets a function to ensure that the stack has > drained > all writers and it is safe to reconfigure (as in callout_drain()). Not all > hardware and control path changes necessarily require a reinitialization. > > The stack-ifnet shadows some information, like interface state, and may do > its > own independent SMP optimizations to avoid contention. > > > Overall, it seems to me that keeping the IFF_DRV_RUNNING >> flag is a lot more practical, not to mention fewer modifications >> to the code. >> > > Generally we want to optimize for the fast packet path, reduce any > friction we > can, and take a hit on reconfig being more "expensive" as it is much less > frequent. > > Mind you not all of this is worked out in detail yet and may be subject to > further changes and refinements as more benchmarking of different > approaches > is performed. > > -- > Andre > > [PS: I'm currently on summer vacation and write this while having access] > > > [description of Andre's proposal below, for reference] >> >> cheers >> luigi >> >> Every queue has its own separate RX interrupt vector which is serviced by >>> a single dedicated ithread (or taskqueue) which does while(1) for work on >>> the RX ring only going to sleep when it reaches the end of work on the >>> ring. >>> It is then woken up by an interrupt to resume work. To prevent a >>> live-lock >>> on the CPU it would periodically yield to allow other kernel and >>> user-space >>> threads to run in between. Each queue's ithread (taskqueue) can be >>> pinned >>> to a particular core or float with a certain stickiness. To reconfigure >>> the >>> card (when the RX ring is affected) the ithread (taskqueue) is terminated >>> and after the changes restarted again. This is done by the control path >>> and allows us to run RX completely lock free because there is only ever >>> one >>> ithread accessing the RX queue doing away with the need for further >>> locking >>> synchronization. >>> >> >> ok I admit i did not think of killing and restarting the ithread, >> but i wonder how >> >>> RX with MSI or legacy interrupts: >>> >>> Here multi-queue is impeded because of some synchronization issues. >>> Depending >>> on the hardware synchronization model the ithreads again do while(1) but >>> have >>> to be made runnable by the interrupt handler when new work has been >>> indicated. >>> >>> TX in general: >>> >>> This is a bit more tricky as we have the hard requirement of packet >>> ordering >>> for individual sessions (tcp and others). That means we must use the >>> same >>> queue for all packets of the same session. This makes running TX >>> non-locked >>> a bit tricky because when we pin a TX queue to a core we must switch to >>> that >>> core first before adding anything to the queue. If the packet was >>> generated >>> on that core there is no overhead, OTOH if not there is the scheduler >>> over- >>> head and some cache losses. Ensuring that a the entire TX path, possibly >>> including user-space (write et al) happens on the same core is difficult >>> and >>> may have its own inefficiencies. The number of TX queue vs. number of >>> cores >>> is another point of contention. To make the 1:1 scheme work well, one >>> must >>> have as many queues as cores. >>> >>> A more scalable and generic approach doesn't bind TX queues to any >>> particular >>> core and is covers each by its own lock to protect the TX ring enqueue. >>> To >>> prevent false lock cache line sharing each TX structure should be on its >>> own >>> cache line. As each session has an affinity hash they become distributed >>> across the TX queues significantly reducing contention. >>> >>> The TX complete handling freeing the mbuf(s) is done the same way as for >>> the >>> RX ring with a dedicated ithread (taskqueue) for each. Also bpf should >>> hook >>> in here (the packet has been sent) instead of at the TX enqueue time. >>> >>> The whole concept of IFF_DRV_RUNNING will go away along with the IFQ >>> macros. >>> Each driver then provides a direct TX function pointer which is put into >>> a >>> decoupled ifnet TX field for use by the stack. Under normal operation >>> all >>> interface TX goes directly into TX DMA ring. The particular multi-queue >>> and locking model is decided by the driver. The kernel provides a couple >>> of common optimized abstractions for use by all drivers that want/need it >>> to avoid code and functionality duplication. When things like ALTQ are >>> activated on an interface the ifnet TX function pointer is replaced with >>> the appropriate intermediate function pointer which eventually will call >>> the drivers TX function. The intermediate TX packet handler (ALTQ) can >>> do its own additional locking on top of the drivers TX locking. >>> >>> Control path: >>> >>> The control path is separate from the high speed TX and RX data paths. >>> It has its own overall lock. Multiple locks typically don't make sense >>> here. If it needs to modify, reconfigure, or newly set up the TX or RX >>> rings then it has to stop the RX ithreads (taskqueues) and to lock the >>> TX queue locks before it can do that. After the changes are made and >>> stable packet processing can continue. >>> >>> I've adjusted / heavily modified fxp, bge and igb drivers in my >>> tcp_workqueue >>> branch to test these concepts. Not all of it is committed or fully up >>> to date. >>> >>> -- >>> Andre >>> >>> >> >> > -- -----------------------------------------+------------------------------- Prof. Luigi RIZZO, rizzo_at_iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/ . Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -----------------------------------------+-------------------------------Received on Tue Aug 06 2013 - 20:55:43 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:40 UTC