Re: pending changes for TOE support

From: Kip Macy <kip.macy_at_gmail.com>
Date: Sat, 15 Dec 2007 10:40:39 -0800
> My initial feeling is that, even if an interface supports TOE, we shouldn't
> enable the capability in the enabled vector by default, as TOE bypasses
> firewall behavior, etc, and would certainly be a surprise if an admin swapped
> a chelsio card for a non-TOE supporting card.  What's your feeling on this?


The current implementation bypasses the firewall. This and likely
other hardware has extensive filtering support so it isn't
neccessarily intrinsic.

The usage model at this moment is that the customer makes a conscious
decision to load the TOE driver and understands the implications. I
think this is quite adequate for 10GigE cards currently. However, this
will need to be revisited when these chips start showing up on
mainstream motherboards.


> + * The TOE API assumes that the tcp offload engine can offload the
> + * the entire connection from set up to teardown, with some provision
> + * being made to allowing the software stack to handle time wait. If
> + * the device does not meet these criteria, it is the driver's responsibility
> + * to overload the functions that it needs to in tcp_usrreqs and make
> + * its own calls to tcp_output if it needs to do so.
>
> While I'm familiar with TCP, I'm less familiar with the scope of what cards
> support for TOE.  Do we know of any cards that are less capable than the
> chelsio card in this respect, or are they all sort of on-par on that front?
> I.e., do we think the above eventuality is likely?

I don't have any way of knowing. I think it is probably safe to say
that any vendors that don't meet that criteria now will in the future
as transistor density increases.

>
> If we don't, then one of the things I'd like to see us do is fairly carefully
> assert, at least for a few months, that TCP never "slips" into any
> transmission-related paths that could lead to truly odd and hard-to-diagnose
> behavior when runnning with TOE.  I.e., tcp_output, etc.


I'm happy to do that. However, I see problems introduced by offloading
connections as being driver bugs much the same as problems caused by
the driver's TCP segmentation offload or checksum offload. The
problems will be isolated to connections using a specific interface.


>
> If we do think it's likely, we don't need to address this immediately, but we
> should make sure that before we ship TOE in a release, we've thought somewhat
> more thoroughly about that case.  As long as TOE remains un-MFC'd, we don't
> find ourselves with an obligation to maintain guarantees about the interfaces,
> and that includes dealing with incompatibility :-).  Do we know if any of the
> current 10gbps vendors other than chelsio are actively looking at TOE on
> FreeBSD, and could be engaged in discussion?


As with most vendors and FreeBSD, I suspect that the two that I know
of will have zero interest until they have a prospective customer. At
which point they'll want it done yesterday.


> I think it might be useful to add a couple of paragraphs here on three topics:
>
> (1) Clarify the way in which windows are updated between the device driver and
>      the socket code, both for sending/receiving.  You talk a bit about
>      "credit", but introducing it up-front would be useful.

I didn't realize a definition was necessary. To the best of my
knowledge this is the common term used when discussing flow control.
I've seen it used for Fibre Channel and IB. The one ambiguity that
arises is whether or not it refers to bytes or segments.



> (2) One of the issues I've run into in the TCP and socket code generally is
>      that there was significant lack of clarity on the "life cycle" of the set
>      of related data structures.  Could you write a bit of text about when
>      drivers will allocate state and when they will free it?  I.e., tu_attach
>      allocates state, tu_{abort,detach} free it, and TCP promises not to call
>      anything before attach or anything after abort/detach.
>
> (3) Could you talk at a high level about the ways in which TOE drivers will
>      interact with TCP?  You do it a bit in each of the sections, but if
>      there's a principle, pulling it out would be useful.  Also, you should
>      indicate whether the driver is allowed to drop the inpcb lock or not.

I've done my best to minimize changes to TCP. It is safe to assume
that the invariants are the same as those for tcp_output. I think we
should ask the author of tcp_output to document the interface,
expected state transitions, and its invariants (joke).

>
> Doing this would address a few of the comments I have below also.
>
> + * + tu_send
> + *   - tells the driver that new data may have been added to the
> + *     socket's send buffer - the driver should not fail if the
> + *     buffer is in fact unchanged
>
> I'm a bit confused by the description of the error condition here.  Could you
> clarify when a driver should return an error, and what the impact of an error
> returned will be on the connection state?  In fact, it probably makes sense to
> have an up-front comment on conventions for error-handling -- if TOE returns
> an error will that generally lead to a TCP tear-down?

The offload routines are substituted for tcp_output and thus should
interact with the stack in the same way. By extension they should have
the same failure modes and invariants.

>
> + *   - The driver expects the inpcb lock to be held and
>
> This comment is truncated -- is there an and?
>
> We should specify that drivers are not allowed to drop the inpcb lock if that
> is the case, FYI.
>
> + * + tu_rcvd
> + *   - returns credits to the driver and triggers window updates
> + *     to the peer (a credit is a byte in the user's receive window)
>
> Might begin with a sentence defining the notion of credit.  Is it possible to
> use tu_rcvd to reduce credit to the card if the socket buffer size is changed,
> or just increase it?
>
> + *   - the driver is expected to determine how many bytes have been
> + *     consumed and credit that back to the card so that it can grow
> + *     the window again
>
> Could you provide an example of how it is to do that -- i.e., is it just going
> to inspect so_rcv in the same way native TCP does?

Correct. It is up to the driver to maintain any ancillary state needed
to determine that.


>
> + *   - this function needs to correctly handle being called any number of
> + *     times without any bytes being consumed from the receive buffer.
> + *   - the driver expects the inpcb lock to be held
> + *
> + * + tu_disconnect
> + *   - tells the driver to send FIN to peer
> + *   - driver is expected to send the remaining data and then do a clean half
> close
> + *   - disconnect implies at least half-close so only send, abort, and detach
> + *     are legal
>
> Could you clarify this a bit?  Do you mean that TCP guarangees that only
> tu_send, tu_abort, and tu_detach will be delivered to the driver in the
> future?

Those are the only things that make sense, but the driver is not
expected to break if TCP does.


>
> + *   - the driver is expected to handle transition through the shutdown
> + *     state machine and allow the stack to support SO_LINGER.
>
> Probably worth commenting that the device driver won't detach the toe state.

>
> + *
> + * + tu_abort
> + *   - closes the connection and sends a RST to peer
> + *   - driver is expectd to trigger an RST and detach the toepcb
>
> In regular TCP, the pru_abort method is only called on pending connections
> while still in the listen queues of a listen socket.  Is this true of
> tu_abort, or is tu_abort a more general method to be used to cancel
> connections?  If so, probably worth commenting on that.


tu_abort is called in place of tcp_output in pru_abort.


> + *   - no further calls are legal after abort
> + *   - the driver expects the inpcb lock to be held
> + *
> + * + tu_detach
> + *   - tells driver that the socket is going away so disconnect
> + *     the toepcb and free appropriate resources
> + *   - allows the driver to cleanly handle the case of connection state
> + *     outliving the socket
> + *   - no further calls are legal after detach
> + *   - the driver acquires the tcbinfo lock
>
> For this call, you haven't specified whether the inpcb lock is held.  If it
> is, the driver acquiring the tcbinfo lock without first dropping the inpcb
> lock would be a lock order reversal.  Should the caller instead acquire/hold
> it?

The inpcb lock no longer exists at this point.



> For the above calls, what guarantees does the TCP stack make about the
> presence of the socket, if any?

The assumptions are the same as those of tcp_output except for
syncache_event and detach, at which points the socket does not yet
exist or no longer exists.


> These interfaces all pass the tcpcb, but in our regular TCP stack, the
> invariant is the existence of the inpcb, not the tcpcb, which may be replaced
> with a tcptw (or in one edge case, inp_ppcb may be NULL).  If there will be
> drivers in the future that implement timewait, perhaps we should be passing in
> the inpcb?

The interface is intended to drop in the place of tcp_output.

>
> + * + tu_syncache_event
> + *   - even if it is not actually needed, the driver is expected to
> + *     call syncache_add for the initial SYN and then syncache_expand
> + *     for the SYN,ACK
> + *   - tells driver that a connection either has not been added or has
> + *     been dropped from the syncache
> + *   - the driver is expected to maintain state that lives outside the
> + *     software stack so the syncache needs to be able to notify the
> + *     toe driver that the software stack is not going to create a connection
> + *     for a received SYN
> + *   - the driver is responsible for any synchronization required
>
> Presumably tu_syncache_event is called from the syncache and locks will be
> held when that happens...?

The driver doesn't care what locks the syncache holds. The syncache
event handler is responsible for acquiring any locks necessary to
synchronize with the rest of the driver for the transition from
SYN_RCVD -> ESTABLISHED.

>
> How will the race between the syncache deciding to drop a connection of its
> own accord and the hardware/driver deciding to accept be addressed, generally
> speaking?

That is a driver implementation issue. The one case to avoid is a
deadlock between the driver calling syncache_expand and the syncache
calling syncache_event.

>
> +8
> +extern struct toe_usrreqs tcp_offload_usrreqs;
>
> What is the purpose of this global?  Presumably we can have two drivers that
> both implement offload at once?

I think you're follow on reading of tcp_offload.c answers that question.


-Kip
Received on Sat Dec 15 2007 - 17:40:40 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:24 UTC