Re: can buffer cache pages be used in ext_pgs mbufs?

From: Rick Macklem <rmacklem_at_uoguelph.ca> Date: Tue, 11 Aug 2020 03:10:39 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC

Konstantin Belousov wrote:
>On Mon, Aug 10, 2020 at 12:46:00AM +0000, Rick Macklem wrote:
>> Konstantin Belousov wrote:
>> >On Fri, Aug 07, 2020 at 09:43:14PM -0700, Kirk McKusick wrote:
>> >> I do not have the answer to your question, but I am copying Kostik
>> >> as if anyone knows the answer, it is probably him.
>> >>
>> >>       ~Kirk
>> >>
>> >> =-=-=
>> >I do not know the exact answer, this is why I did not followed up on the
>> >original question on current_at_.  In particular, I have no idea about the
>> >ext_pgs mechanism.
>> >
>> >Still I can point one semi-obvious aspect of your proposal.
>> >
>> >When the buffer is written (with bwrite()), its pages are sbusied and
>> >the write mappings of them are invalidated. The end effect is that no
>> >modifications to the pages are possible until they are unbusied. This,
>> >together with the lock of the buffer that holds the pages, effectively
>> >stops all writes either through write(2) or by mmaped regions.
>> >
>> >In other words, any access for write to the range of file designated by
>> >the buffer, causes the thread to block until the pages are unbusied and
>> >the buffer is unlocked.  Which in described case would mean, until NFS
>> >server responds.
>> >
>> >If this is fine, then ok.
>> For what I am thinking of, I would say that is fine, since the ktls code reads
>> the pages to encrypt/send them, but can use other allocated pages for
>> the encrypted data.
>>
>> >Rick, do you know anything about the vm page lifecycle as mb_ext_pgs ?
>> Well, the anonymous pages (the only ones I've been using sofar) are
>> allocated with:
>>         vm_page_alloc(NULL, 0, VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ |
>>                VM_ALLOC_NODUMP | VM_ALLOC_WIRED);
>>
>> and then the m_ext_ext_free function (mb_free_mext_pgs()) does:
>>         vm_page_unwire_noq(pg);
>>         vm_page_free(pg);
>> on each of them.
>>
>> m->m_ext_ext_free() is called in tls_encrypt() when it no longer wants the
>> pages, but is normally called via m_free(m), which calls mb_free_extpg(m),
>> although there are a few other places.
>>
>> Since m_ext_ext_free is whatever function you want to make it, I suppose the
>> answer is "until your m_ext.ext_free" function is called.
>>
>> At this time, for ktls, if you are using software encryption, the call to ktls_encrypt(),
>> which is done before passing the mbufs down to TCP is when it is done with the
>> unencrypted data pages. (I suppose there is no absolute guarantee that this
>> happens before the kernel RPC layer times out waiting for an RPC reply, but it
>> is almost inconceivable, since this happens before the RPC request is passed
>> down to TCP.)
>>
>> The case I now think is more problematic is the "hardware assist" case. Although
>> no hardware/driver yet does this afaik, I suspect that the unencrypted data page
>> mbufs could end up stuck in TCP for a long time, in case a retransmit is needed.
>>
>> So, I now think I might need to delay the bufdone() call until the m_ext_ext_free()
>> call has been done for the pages, if they are buffer cache pages?
>> --> Usually I would expect the m_ext_ext_free() call for the mbuf(s) that
>>        hold the data to be written to the server to be done long before
>>        bufdone() would be called for the buffer that is being written,
>>        but there is no guarantee.
>>
>> Am I correct in assuming that the pages for the buffer will remain valid and
>> readable through the direct map until bufdone() is called?
>> If I am correct w.r.t. this, it should work so long as the m_ext_ext_free() calls
>> for the pages happen before the bufdone() call on the bp, I think?
>
>I think there is further complication with non-anonymous pages.
>You want (or perhaps need) the page content to be immutable and not
>changed while you pass pages around and give the for ktls sw or hw
>processing.  Otherwise it could not pass the TLS authentification if
>page was changed in process.
>
>Similar issue exists when normal buffer writes are scheduled through
>the strategy(), and you can see that bufwrite() does vfs_busy_pages()
>with clear_modify=1, which does two things:
>- sbusy the pages (sbusy pages can get new read-only mappings, but cannot
>  be mapped rw)
>- pmap_remove_write() on the pages to invalidate all current writeable
>  mappings.
>
>This state should be kept until ktls is completely done with the pages.
I am now thinking that this is done exactly as you describe above and
doesn't require any changes.

The change I am planning is below the strategy routine in the function
that does the write RPC.
It currently copies the data from the buffer into mbuf clusters.
After this change, it would put the physical page #s for the buffer in the
mbuf(s) and then wait for them all to be m_ext_ext_free()d before calling
bufdone().
--> The only difference is the wait before the bufdone() call in the RPC layer
       below the strategy routine. (bufdone() is the only call the NFS client
       seems to do below the strategy routine, so I assume it ends the state
       you describe above?)

rick