Re: I/OAT ... Coming Soon ?

From: Andre Oppermann <andre_at_freebsd.org> Date: Fri, 16 Nov 2007 01:43:14 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:22 UTC

Jack Vogel wrote:
> On Nov 15, 2007 4:22 PM, Andre Oppermann <andre_at_freebsd.org> wrote:
>> Scott Ullrich wrote:
>>> On 11/15/07, Doug Ambrisko <ambrisko_at_ambrisko.com> wrote:
>>>> Hmm, I forgot about the 2970 which are AMD based.  Can you check the
>>>> BIOS to see if there is an option to turn it on?  I think this is an
>>>> Intel feature.  AMD might have something close?  We have one 2970
>>>> that we've played with a little but not much.  I can't say for sure
>>>> if it has it.
>>> Right you are.   As of BIOS 1.2.2 I do not see a I/OAT option.   Guess
>>> I will need to pick up a different server as we are interested in what
>>> kind of packet forwarding rate increase that this feature might bring
>>> on a heavily loaded firewall.
>> Not much.  Unless your firewall is in usermode.  Otherwise the data
>> stays in the kernel and I/OAT is of not help as no copying happens.
>> Your CPU is probably spending half of its clock cycles waiting on
>> cache misses from newly arrived packets.  Some Intel chipset integrated
>> gige ports have a cache prefetch feature (duno whether our driver
>> supports it) that would help quite a bit for your case.
> 
> What might help this is multiqueue support on the receive AND send,
> and stack support for the same. Not sure what the stack changes
> would look like, but I know there's interest in this sort of thing, so
> naturally I'd be into it :)

Dunno if multiqueue is a big win here.  You have to make sure that
packet order is maintained which kinda implies a single queue.  Of
course one could spread some load with fixed hashes to keep flows
together.

The reason a small 1GHz embedded MIPS CPU with integrated GigE ports
can do more than 1Mpps is the cache prefetching feature.  The thingies
generally move the first 128bytes of every packet received into the
L2 cache.  This is enough for the headers and to perform a lookup on
the routing table or the TCP/UDP control block table without much
delay.  The normal PC architecture is quite broken in that regard
as everything that comes in through DMA is in cold main memory.
Once the CPU wants to look at it, it has to wait an insane amount
of time.  That times the number of packets.  In pure forwarding
applications (routing) it wastes half of all CPU cycles with waiting
on main memory.

-- 
Andre