Re: RFC: ktls and krpc using M_EXTPG mbufs

From: Rick Macklem <rmacklem_at_uoguelph.ca> Date: Mon, 27 Jul 2020 23:51:22 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC

Andrew Gallatin wrote:
>On 2020-07-19 19:34, Rick Macklem wrote:
>> I spent a little time chasing a problem in the nfs-over-tls code, where it
>> would sometimes end up with corrupted data in the file(s) of a mirrored
>> pNFS configuration.
>>
>> I think the problem was that the code filled the data to be written into
>> anonymous page M_EXTPG mbufs, then did a m_copym() { copy by
>> reference } and used the copies for the mirrored writes.
>> --> In ktls_encrypt(), the encryption was done to the same pages and,
>>         sometimes, the encrypted data got encrypted again during the
>>         sosend() of the other copy.
>>
>> Although I haven't reproduced it, a regular kernel write RPC could suffer the
>> same consequences if the RPC is retried (it keeps an m_copym() copy
>> of the request in the krpc for an RPC retry).
>>
>> At this time, the code in projects/nfs-over-tls works correctly, since it
>> always fills the data to be written into mbuf clusters, m_copym()s those
>> and then copies those { real copying using memcpy() } via
>> mb_mapped_to_unmapped() just before calling sosend().
>> --> This works, but it would be nice to avoid the mb_mapped_to_unmapped()
>>        copying for all the data being written via an NFS over TLS connection.
>>
>> For the TCP_TLS_MODE_SW case:
>> --> The NFS code can fill the written data into anonymous pages on M_EXTPG
>>         mbufs.
>> Then, the ktls_encrypt() could be modified to
>> allocate a new set of anonymous pages for the destination side of
>> the encryption (it already does this for the sendfile case) and put those
>> in a new mbuf list.
>> --> This would result in new anonymous pages and mbufs being allocated,
>>         but would not do memcpy()s.
>> After encryption, it would just do a m_freem() on the unencrypted list.
>> --> For the krpc client case, this call would only decrement the reference
>>        count on the unencrypted list and it could be used for a retry by the krpc
>>        and then be free'd { m_freem() call } after a reply is received.
>>
>> If doing this for all the sosend()s of anonymous page M_EXTPG mbufs seems
>> like unnecessary overhead, the above could be enabled via a setsockopt()
>> on the socket.
>>
>> What do others think of this?
>
>Several comments:
>
>mb_mapped_to_unmapped() is surprisingly inexpensive.  It was less than
>5% before I converted iflib to M_NOMAP aware.
Hmm. Just wondering what the 5% refers to?
5% difference in throughput for a data stream
5% increase in CPU overheads
or ???

I do agree that, with multiple cores these days, avoiding the memcpy()s in
the client isn't that big a deal.
--> This issue is client side only. The NFS server can generate read and readdir
      replies (the only big ones) in anonymous ext_pgs mbufs now.

>It seems like NFS should be constructing mbufs like sendfile does, and
>pointing mbufs at its pages.  This would cause the crypto code to
>allocate a new set of pages upon encryption.
I suppose the ideal would be to use the pages that already hold the data
in the buffer cache, but I haven't even looked at what it might take to
do that? (The buffer cache block would have to remain busied until the
mbuf is free'd or something like that.)
I kinda plan on looking at this someday...

I suppose I could "pretend" they aren't anonymous pages by not
setting the EPG_ANON_FLAG, but that still wouldn't be enough to
fix this problem.
--> Not only does ktls_encrypt() need to use different pages, it needs
      to allocate new mbuf(s) for them, so that the unencrypted pages
      will still be associated with the mbuf list passed in.
(I don't really see "pretending" the pages aren't anonymous makes much
 difference?)

>> For the hardware offload case:
>> - Can I assume that the anonymous pages in M_EXTPG mbufs will remain
>>     unchanged?
>> --> If so, and it won't change to TCP_TLS_MODE_SW, the NFS code could
>>         fill the data to be written into M_EXTPG mbufs safely.
>>
>> - And, if so, can I safely use the ktls_session mode field to decide if offload
>>    is happening?
>>    I see the TCP_TXTLS_MODE socket opt which seems to
>>    switch the mode to TCP_TLS_MODE_SW.
>>    When does this happen? Or, can this happen to a session once in use?
>
>Yes.  The intent is to allow something (TCP stack, smart user daemon) to
>look at a connection & move it from hardware to software, if it has a
>lot of TCP re-transmits.
Ok, so I don't think the NFS code should assume the pages will remain
unencrypted, even if it appears hardware assist is being used, unless the
software case is changed.

As you note, just using mb_mapped_to_unmapped() works pretty well,
so I don't think this is something critical to do. (I have a non-working
patch. If I happen to get it working, I'll try and see what performance
difference I get.)

>Drew
Thanks for the comments, rick