Re: [RFC] Add support for hardware transmit rate limiting queues [WAS: Add support for changing the flow ID of TCP connections]

From: Navdeep Parhar <np_at_FreeBSD.org> Date: Wed, 20 Aug 2014 11:44:26 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:51 UTC

On 08/20/14 00:34, Hans Petter Selasky wrote:
> Hi,
>
> A month has passed since the last e-mail on this topic, and in the
> meanwhile some new patches have been created and tested:
>
> Basically the approach has been changed a little bit:
>
> - The creation of hardware transmit rings has been made independent of
> the TCP stack. This allows firewall applications to forward traffic into
> hardware transmit rings aswell, and not only native TCP applications.
> This should be one more reason to get the feature into the kernel.
>
> - A hardware transmit ring basically can have two modes: FIXED-RATE or
> AUTOMATIC-RATE. In the fixed rate mode all traffic is sent at a fixed
> bytes per second rate. In the automatic mode you can configure a time
> after which the TX queue must be empty. The hardware driver uses this to
> configure the actual rate. In automatic mode you can also set an upper
> and lower transmit rate limit.
>
> - The MBUF has got a new field in the packet header: "txringid"
>
> - IOCTLs for TCP v4 and v6 sockets has been updated to allow setting of
> the "txringid" field in the mbuf.
>
> The current patch [see attachment] should be much simpler and less
> intrusive than the previous one.
>
> Any comments ?

Here are some thoughts.  The first two bullets cover relatively
minor issues, the rest are more important.

- All of the mbuf pkthdr fields today have the same meaning no matter
   what the context.  It is not clear what txringid's global meaning is.
   Is it even possible for driver foo to interpret it the same way as
   driver bar?  What if the number of rings are different, or if the ring
   at the particular index for foo is setup differently than the ring at
   that same index for bar?  You are attempting to influence the driver's
   txq selection and traditionally the mbuf's flowid has been used for
   this purpose.  Have you considered allowing the user to set the flowid
   directly?  And mark it as such via a new rsstype so the kernel will
   leave it alone.

- uint32_t -> m_flowid_t is plain gratuitous.  Now we need to include
   mbuf.h in more places just to get this definition.  What's the
   advantage of this?  style(9) isn't too fond of typedefs either.  Also,
   drivers *do* need to know the width of the flowid.  At least lagg(4)
   looks at the high bits of the flowid (see flowid_shift in lagg).  How
   high it can go depends on the width of the flowid.

- Interfaces can come and go, routes can change, and so the relationship
   between an inpcb and txringid is not stable at all.  What happens when
   the outbound route for an inpcb changes?

- The in_ratectlreq structure that you propose is inadequate in its
   current form.  For example, cxgbe's hardware can do rate limiting on a
   per-ring as well as per-connection basis, and it allows for pps,
   bandwidth, or min-max limits.  I think this is the critical piece that
   we NIC maintainers must agree on before any code hits the core kernel:
   how to express a rate-limit policy in a standard way and allow for
   hardware assistance opportunistically.  ipfw(4)'s dummynet is probably
   interested in this part too, so it's great that Luigi is paying
   attention to this thread.

- The RATECTL ioctls deal with in_ratectlreq so we need to standardize
   the ratectlreq structure before these ioctls can be considered generic
   ifnet ioctls.  This is the reason cxgbetool (and not ifconfig) has a
   private ioctl to frob cxgbe's per-queue rate-limiters.  I did not want
   to add ifnet ioctls that in reality were cxgbe only.  Ditto for i2c
   ioctls.  Now we have multiple drivers with i2c and melifaro_at_ is doing
   the right thing by promoting these private ioctls to a standard ifnet
   ioctl.  Have you considered a private mlxtool as a stop gap measure?

To summarize my take on all of this: we need a standard ratectlreq
structure, a standard way to associate an inpcb with one, and a standard
way to pass on this info to if_transmit.  After all this is in place we
could even have a dummynet-ish software layer that implements rate
limiters when the underlying hardware offers no assistance.

Regards,
Navdeep