Re: Can In-Kernel TLS (kTLS) work with any OpenSSL Application?

From: Neel Chauhan <nc_at_freebsd.org> Date: Thu, 28 Jan 2021 14:33:14 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:27 UTC

Hi Mark,

Thank you so much for your response describing how QAT encryption works.

I learned that my server (HPE ProLiant ML110 Gen10) does not have QAT, 
mainly because the chipset (Intel C621) doesn't enable it.

For reference, my firewall box (Intel D-1518-based HPE ProLiant EC200a) 
probably does, but I'm not going to use it for Tor.

Tor uses 512-byte sized packets (a.k.a "cells") so even if I had QAT it 
may not work well, not to mention Tor is singlethreaded.

I think I'll stick with kTLS with AESNI when 13.0-RELEASE is out. Worse 
case scenario I'll buy an AMD Ryzen-based PC and offload my Tor servers 
to it (assuming latest Ryzen > Skylake Xeon Scalable in single-thread 
performance).

-Neel

On 2021-01-27 11:04, Mark Johnston wrote:
> On Sat, Jan 23, 2021 at 03:25:59PM +0000, Rick Macklem wrote:
>> Ronald Klop wrote:
>> >On Wed, 20 Jan 2021 21:21:15 +0100, Neel Chauhan <nc_at_freebsd.org> wrote:
>> >
>> >> Hi freebsd-current_at_,
>> >>
>> >> I know that In-Kernel TLS was merged into the FreeBSD HEAD tree a while
>> >> back.
>> >>
>> >> With 13.0-RELEASE around the corner, I'm thinking about upgrading my
>> >> home server, well if I can accelerate any SSL application.
>> >>
>> >> I'm asking because I have a home server on a symmetrical Gigabit
>> >> connection (Google Fiber/Webpass), and that server runs a Tor relay. If
>> >> you're interested in how Tor works, the EFF has a writeup:
>> >> https://www.eff.org/pages/what-tor-relay
>> >>
>> >> But the main point for you all is: more-or-less Tor relays deal with
>> >> 1000s TLS connections going into and out of the server.
>> >>
>> >> Would In-Kernel TLS help with an application like Tor (or even load
>> >> balancers/TLS termination), or is it more for things like web servers
>> >> sending static files via sendfile() (e.g. CDN used by Netflix).
>> >>
>> >> My server could also work with Intel's QuickAssist (since it has an
>> >> Intel Xeon "Scalable" CPU). Would QuickAssist SSL be more helpful here?
>> There is now qat(4), which KTLS should be able to use, but I do
>> not think it has been tested for this. I also have no idea
>> if it can be used effectively for userland encryption?
> 
> KTLS requires support for separate output buffers and AAD buffers, 
> which
> I hadn't implemented in the committed driver.  I have a working patch
> which adds that, so when that's committed qat(4) could in principle be
> used with KTLS.  So far I only tested with /dev/crypto and a couple of
> debug sysctls used to toggle between the different cryptop buffer
> layouts, not with KTLS proper.
> 
> qat(4) can be used by userspace via cryptodev(4).  This comes with a
> fair bit of overhead since it involves a round-trip through the kernel
> and some extra copying.  AFAIK we don't have any framework for exposing
> crypto devices directly to userspace, akin to DPDK's polling mode
> drivers or netmap.
> 
> I've seen a few questions about the comparative (dis)advantages of QAT
> and AES-NI so I'll sidetrack a bit and try to characterize qat(4)'s
> performance here based on some microbenchmarking I did this week.  This
> was all done in the kernel and so might need some qualification if
> you're interested in using qat(4) from userspace.  Numbers below are
> gleaned from an Atom C3558 at 2.2GHz with an integrated QAT device.  I
> mostly tested AES-CBC-256 and AES-GCM-256.
> 
> The high-level tradeoffs are:
> - qat(4) introduces a lot of latency.  For a single synchronous
>   operation it can take between 2x and 100x more time than aesni(4) to
>   complete.  aesni takes 1000-2000 cycles to handle a request plus
>   3-5 cycles per byte depending on the algorithm.  qat takes at least
>   ~150,000 cycles between calling crypto_dispatch() and the cryptop
>   completion callback, plus 5-8 cycles per byte.  qat dispatch itself 
> is
>   quite cheap, typically 1000-2000 cycles depending on the size of the
>   buffer.  Handling a completion interrupt involves a context switch to
>   the driver ithread but this is also a small cost relative to the
>   entire operation.  So, for anything where latency is crucial QAT is
>   probably not a great bet.
> - qat can save a correspondingly large number of CPU cycles.  It takes
>   qat roughly twice as long as aesni to complete encryption of a 32KB
>   buffer using AES-CBC-256 (more with GCM), but with qat the CPU is 
> idle
>   much of the time.  Dispatching the request to firmware takes less 
> than
>   1% of the total time elapsed between request dispatch and completion,
>   even with small buffers.  OTOH with really small buffers aesni can
>   complete a request in the time that it takes qat just to dispatch the
>   request to the device, so at best qat will give comparable throughput
>   and CPU usage and worse latency.
> - qat can handle multiple requests in parallel.  This can improve
>   throughput dramatically if the producer can keep qat busy.
>   Empirically, the maximum throughput improvement is a function of the
>   request size.  For example, counting the number of cycles required to
>   encrypt 100,000 buffers using AES-GCM-256:
> 
>   max # in flight       1        16       64        128
> 
>   aesni, 16B           206M     n/a      n/a        n/a
>   aesni, 4KB          1.52B     n/a      n/a        n/a
>   aesni, 32KB         10.8B     n/a      n/a        n/a
>   qat,   16B          17.1B   1.11B     219M       184M
>   qat,   4KB          20.9B   1.68B     710M       694M
>   qat,   32KB         38.2B   8.37B    4.25B      4.23B
> 
>   As a side note, OpenCrypto supports async dispatch for software 
> crypto
>   drivers, in which crypto_dispatch() hands work off to other threads.
>   This is enabled by net.inet.ipsec.async_crypto, for example.  Of
>   course, the maximum parallelism is limited by the number of CPUs in
>   the system, but this can improve throughput significantly as well if
>   you're willing to spend the corresponding CPU cycles.
> 
> To summarize, QAT can be beneficial when some or all of the following
> apply:
> 1. You have large requests.  qat can give comparable throughput for
>    small requests if the producer can exploit parallelism in qat, 
> though
>    OpenCrypto's backpressure mechanism is really primitive (arguably
>    non-existent) and performance will tank if things get to a point
>    where qat can't keep up.
> 2. You're able to dispatch requests in parallel.  But see point 1.
> 3. CPU cycles are precious and the extra latency is tolerable.
> 3b. aesni doesn't implement some transform that you care about, but qat
>     does.  Some (most?) Xeons don't implement the SHA extensions for
>     instance.  I don't have a sense for how the plain cryptosoft driver
>     performs relative to aesni though.
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to 
> "freebsd-current-unsubscribe_at_freebsd.org"