On Sat, Jan 23, 2021 at 03:25:59PM +0000, Rick Macklem wrote: > Ronald Klop wrote: > >On Wed, 20 Jan 2021 21:21:15 +0100, Neel Chauhan <nc_at_freebsd.org> wrote: > > > >> Hi freebsd-current_at_, > >> > >> I know that In-Kernel TLS was merged into the FreeBSD HEAD tree a while > >> back. > >> > >> With 13.0-RELEASE around the corner, I'm thinking about upgrading my > >> home server, well if I can accelerate any SSL application. > >> > >> I'm asking because I have a home server on a symmetrical Gigabit > >> connection (Google Fiber/Webpass), and that server runs a Tor relay. If > >> you're interested in how Tor works, the EFF has a writeup: > >> https://www.eff.org/pages/what-tor-relay > >> > >> But the main point for you all is: more-or-less Tor relays deal with > >> 1000s TLS connections going into and out of the server. > >> > >> Would In-Kernel TLS help with an application like Tor (or even load > >> balancers/TLS termination), or is it more for things like web servers > >> sending static files via sendfile() (e.g. CDN used by Netflix). > >> > >> My server could also work with Intel's QuickAssist (since it has an > >> Intel Xeon "Scalable" CPU). Would QuickAssist SSL be more helpful here? > There is now qat(4), which KTLS should be able to use, but I do > not think it has been tested for this. I also have no idea > if it can be used effectively for userland encryption? KTLS requires support for separate output buffers and AAD buffers, which I hadn't implemented in the committed driver. I have a working patch which adds that, so when that's committed qat(4) could in principle be used with KTLS. So far I only tested with /dev/crypto and a couple of debug sysctls used to toggle between the different cryptop buffer layouts, not with KTLS proper. qat(4) can be used by userspace via cryptodev(4). This comes with a fair bit of overhead since it involves a round-trip through the kernel and some extra copying. AFAIK we don't have any framework for exposing crypto devices directly to userspace, akin to DPDK's polling mode drivers or netmap. I've seen a few questions about the comparative (dis)advantages of QAT and AES-NI so I'll sidetrack a bit and try to characterize qat(4)'s performance here based on some microbenchmarking I did this week. This was all done in the kernel and so might need some qualification if you're interested in using qat(4) from userspace. Numbers below are gleaned from an Atom C3558 at 2.2GHz with an integrated QAT device. I mostly tested AES-CBC-256 and AES-GCM-256. The high-level tradeoffs are: - qat(4) introduces a lot of latency. For a single synchronous operation it can take between 2x and 100x more time than aesni(4) to complete. aesni takes 1000-2000 cycles to handle a request plus 3-5 cycles per byte depending on the algorithm. qat takes at least ~150,000 cycles between calling crypto_dispatch() and the cryptop completion callback, plus 5-8 cycles per byte. qat dispatch itself is quite cheap, typically 1000-2000 cycles depending on the size of the buffer. Handling a completion interrupt involves a context switch to the driver ithread but this is also a small cost relative to the entire operation. So, for anything where latency is crucial QAT is probably not a great bet. - qat can save a correspondingly large number of CPU cycles. It takes qat roughly twice as long as aesni to complete encryption of a 32KB buffer using AES-CBC-256 (more with GCM), but with qat the CPU is idle much of the time. Dispatching the request to firmware takes less than 1% of the total time elapsed between request dispatch and completion, even with small buffers. OTOH with really small buffers aesni can complete a request in the time that it takes qat just to dispatch the request to the device, so at best qat will give comparable throughput and CPU usage and worse latency. - qat can handle multiple requests in parallel. This can improve throughput dramatically if the producer can keep qat busy. Empirically, the maximum throughput improvement is a function of the request size. For example, counting the number of cycles required to encrypt 100,000 buffers using AES-GCM-256: max # in flight 1 16 64 128 aesni, 16B 206M n/a n/a n/a aesni, 4KB 1.52B n/a n/a n/a aesni, 32KB 10.8B n/a n/a n/a qat, 16B 17.1B 1.11B 219M 184M qat, 4KB 20.9B 1.68B 710M 694M qat, 32KB 38.2B 8.37B 4.25B 4.23B As a side note, OpenCrypto supports async dispatch for software crypto drivers, in which crypto_dispatch() hands work off to other threads. This is enabled by net.inet.ipsec.async_crypto, for example. Of course, the maximum parallelism is limited by the number of CPUs in the system, but this can improve throughput significantly as well if you're willing to spend the corresponding CPU cycles. To summarize, QAT can be beneficial when some or all of the following apply: 1. You have large requests. qat can give comparable throughput for small requests if the producer can exploit parallelism in qat, though OpenCrypto's backpressure mechanism is really primitive (arguably non-existent) and performance will tank if things get to a point where qat can't keep up. 2. You're able to dispatch requests in parallel. But see point 1. 3. CPU cycles are precious and the extra latency is tolerable. 3b. aesni doesn't implement some transform that you care about, but qat does. Some (most?) Xeons don't implement the SHA extensions for instance. I don't have a sense for how the plain cryptosoft driver performs relative to aesni though.Received on Wed Jan 27 2021 - 18:04:41 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:27 UTC