Hi Mark, Thank you so much for your response describing how QAT encryption works. I learned that my server (HPE ProLiant ML110 Gen10) does not have QAT, mainly because the chipset (Intel C621) doesn't enable it. For reference, my firewall box (Intel D-1518-based HPE ProLiant EC200a) probably does, but I'm not going to use it for Tor. Tor uses 512-byte sized packets (a.k.a "cells") so even if I had QAT it may not work well, not to mention Tor is singlethreaded. I think I'll stick with kTLS with AESNI when 13.0-RELEASE is out. Worse case scenario I'll buy an AMD Ryzen-based PC and offload my Tor servers to it (assuming latest Ryzen > Skylake Xeon Scalable in single-thread performance). -Neel On 2021-01-27 11:04, Mark Johnston wrote: > On Sat, Jan 23, 2021 at 03:25:59PM +0000, Rick Macklem wrote: >> Ronald Klop wrote: >> >On Wed, 20 Jan 2021 21:21:15 +0100, Neel Chauhan <nc_at_freebsd.org> wrote: >> > >> >> Hi freebsd-current_at_, >> >> >> >> I know that In-Kernel TLS was merged into the FreeBSD HEAD tree a while >> >> back. >> >> >> >> With 13.0-RELEASE around the corner, I'm thinking about upgrading my >> >> home server, well if I can accelerate any SSL application. >> >> >> >> I'm asking because I have a home server on a symmetrical Gigabit >> >> connection (Google Fiber/Webpass), and that server runs a Tor relay. If >> >> you're interested in how Tor works, the EFF has a writeup: >> >> https://www.eff.org/pages/what-tor-relay >> >> >> >> But the main point for you all is: more-or-less Tor relays deal with >> >> 1000s TLS connections going into and out of the server. >> >> >> >> Would In-Kernel TLS help with an application like Tor (or even load >> >> balancers/TLS termination), or is it more for things like web servers >> >> sending static files via sendfile() (e.g. CDN used by Netflix). >> >> >> >> My server could also work with Intel's QuickAssist (since it has an >> >> Intel Xeon "Scalable" CPU). Would QuickAssist SSL be more helpful here? >> There is now qat(4), which KTLS should be able to use, but I do >> not think it has been tested for this. I also have no idea >> if it can be used effectively for userland encryption? > > KTLS requires support for separate output buffers and AAD buffers, > which > I hadn't implemented in the committed driver. I have a working patch > which adds that, so when that's committed qat(4) could in principle be > used with KTLS. So far I only tested with /dev/crypto and a couple of > debug sysctls used to toggle between the different cryptop buffer > layouts, not with KTLS proper. > > qat(4) can be used by userspace via cryptodev(4). This comes with a > fair bit of overhead since it involves a round-trip through the kernel > and some extra copying. AFAIK we don't have any framework for exposing > crypto devices directly to userspace, akin to DPDK's polling mode > drivers or netmap. > > I've seen a few questions about the comparative (dis)advantages of QAT > and AES-NI so I'll sidetrack a bit and try to characterize qat(4)'s > performance here based on some microbenchmarking I did this week. This > was all done in the kernel and so might need some qualification if > you're interested in using qat(4) from userspace. Numbers below are > gleaned from an Atom C3558 at 2.2GHz with an integrated QAT device. I > mostly tested AES-CBC-256 and AES-GCM-256. > > The high-level tradeoffs are: > - qat(4) introduces a lot of latency. For a single synchronous > operation it can take between 2x and 100x more time than aesni(4) to > complete. aesni takes 1000-2000 cycles to handle a request plus > 3-5 cycles per byte depending on the algorithm. qat takes at least > ~150,000 cycles between calling crypto_dispatch() and the cryptop > completion callback, plus 5-8 cycles per byte. qat dispatch itself > is > quite cheap, typically 1000-2000 cycles depending on the size of the > buffer. Handling a completion interrupt involves a context switch to > the driver ithread but this is also a small cost relative to the > entire operation. So, for anything where latency is crucial QAT is > probably not a great bet. > - qat can save a correspondingly large number of CPU cycles. It takes > qat roughly twice as long as aesni to complete encryption of a 32KB > buffer using AES-CBC-256 (more with GCM), but with qat the CPU is > idle > much of the time. Dispatching the request to firmware takes less > than > 1% of the total time elapsed between request dispatch and completion, > even with small buffers. OTOH with really small buffers aesni can > complete a request in the time that it takes qat just to dispatch the > request to the device, so at best qat will give comparable throughput > and CPU usage and worse latency. > - qat can handle multiple requests in parallel. This can improve > throughput dramatically if the producer can keep qat busy. > Empirically, the maximum throughput improvement is a function of the > request size. For example, counting the number of cycles required to > encrypt 100,000 buffers using AES-GCM-256: > > max # in flight 1 16 64 128 > > aesni, 16B 206M n/a n/a n/a > aesni, 4KB 1.52B n/a n/a n/a > aesni, 32KB 10.8B n/a n/a n/a > qat, 16B 17.1B 1.11B 219M 184M > qat, 4KB 20.9B 1.68B 710M 694M > qat, 32KB 38.2B 8.37B 4.25B 4.23B > > As a side note, OpenCrypto supports async dispatch for software > crypto > drivers, in which crypto_dispatch() hands work off to other threads. > This is enabled by net.inet.ipsec.async_crypto, for example. Of > course, the maximum parallelism is limited by the number of CPUs in > the system, but this can improve throughput significantly as well if > you're willing to spend the corresponding CPU cycles. > > To summarize, QAT can be beneficial when some or all of the following > apply: > 1. You have large requests. qat can give comparable throughput for > small requests if the producer can exploit parallelism in qat, > though > OpenCrypto's backpressure mechanism is really primitive (arguably > non-existent) and performance will tank if things get to a point > where qat can't keep up. > 2. You're able to dispatch requests in parallel. But see point 1. > 3. CPU cycles are precious and the extra latency is tolerable. > 3b. aesni doesn't implement some transform that you care about, but qat > does. Some (most?) Xeons don't implement the SHA extensions for > instance. I don't have a sense for how the plain cryptosoft driver > performs relative to aesni though. > _______________________________________________ > freebsd-current_at_freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscribe_at_freebsd.org"
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:27 UTC