Re: can buffer cache pages be used in ext_pgs mbufs?

From: Konstantin Belousov <kostikbel_at_gmail.com> Date: Tue, 11 Aug 2020 20:54:22 +0300 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC

On Tue, Aug 11, 2020 at 03:10:39AM +0000, Rick Macklem wrote:
> Konstantin Belousov wrote:
> >On Mon, Aug 10, 2020 at 12:46:00AM +0000, Rick Macklem wrote:
> >> Konstantin Belousov wrote:
> >> >On Fri, Aug 07, 2020 at 09:43:14PM -0700, Kirk McKusick wrote:
> >> >> I do not have the answer to your question, but I am copying Kostik
> >> >> as if anyone knows the answer, it is probably him.
> >> >>
> >> >>       ~Kirk
> >> >>
> >> >> =-=-=
> >> >I do not know the exact answer, this is why I did not followed up on the
> >> >original question on current_at_.  In particular, I have no idea about the
> >> >ext_pgs mechanism.
> >> >
> >> >Still I can point one semi-obvious aspect of your proposal.
> >> >
> >> >When the buffer is written (with bwrite()), its pages are sbusied and
> >> >the write mappings of them are invalidated. The end effect is that no
> >> >modifications to the pages are possible until they are unbusied. This,
> >> >together with the lock of the buffer that holds the pages, effectively
> >> >stops all writes either through write(2) or by mmaped regions.
> >> >
> >> >In other words, any access for write to the range of file designated by
> >> >the buffer, causes the thread to block until the pages are unbusied and
> >> >the buffer is unlocked.  Which in described case would mean, until NFS
> >> >server responds.
> >> >
> >> >If this is fine, then ok.
> >> For what I am thinking of, I would say that is fine, since the ktls code reads
> >> the pages to encrypt/send them, but can use other allocated pages for
> >> the encrypted data.
> >>
> >> >Rick, do you know anything about the vm page lifecycle as mb_ext_pgs ?
> >> Well, the anonymous pages (the only ones I've been using sofar) are
> >> allocated with:
> >>         vm_page_alloc(NULL, 0, VM_ALLOC_NORMAL | VM_ALLOC_NOOBJ |
> >>                VM_ALLOC_NODUMP | VM_ALLOC_WIRED);
> >>
> >> and then the m_ext_ext_free function (mb_free_mext_pgs()) does:
> >>         vm_page_unwire_noq(pg);
> >>         vm_page_free(pg);
> >> on each of them.
> >>
> >> m->m_ext_ext_free() is called in tls_encrypt() when it no longer wants the
> >> pages, but is normally called via m_free(m), which calls mb_free_extpg(m),
> >> although there are a few other places.
> >>
> >> Since m_ext_ext_free is whatever function you want to make it, I suppose the
> >> answer is "until your m_ext.ext_free" function is called.
> >>
> >> At this time, for ktls, if you are using software encryption, the call to ktls_encrypt(),
> >> which is done before passing the mbufs down to TCP is when it is done with the
> >> unencrypted data pages. (I suppose there is no absolute guarantee that this
> >> happens before the kernel RPC layer times out waiting for an RPC reply, but it
> >> is almost inconceivable, since this happens before the RPC request is passed
> >> down to TCP.)
> >>
> >> The case I now think is more problematic is the "hardware assist" case. Although
> >> no hardware/driver yet does this afaik, I suspect that the unencrypted data page
> >> mbufs could end up stuck in TCP for a long time, in case a retransmit is needed.
> >>
> >> So, I now think I might need to delay the bufdone() call until the m_ext_ext_free()
> >> call has been done for the pages, if they are buffer cache pages?
> >> --> Usually I would expect the m_ext_ext_free() call for the mbuf(s) that
> >>        hold the data to be written to the server to be done long before
> >>        bufdone() would be called for the buffer that is being written,
> >>        but there is no guarantee.
> >>
> >> Am I correct in assuming that the pages for the buffer will remain valid and
> >> readable through the direct map until bufdone() is called?
> >> If I am correct w.r.t. this, it should work so long as the m_ext_ext_free() calls
> >> for the pages happen before the bufdone() call on the bp, I think?
> >
> >I think there is further complication with non-anonymous pages.
> >You want (or perhaps need) the page content to be immutable and not
> >changed while you pass pages around and give the for ktls sw or hw
> >processing.  Otherwise it could not pass the TLS authentification if
> >page was changed in process.
> >
> >Similar issue exists when normal buffer writes are scheduled through
> >the strategy(), and you can see that bufwrite() does vfs_busy_pages()
> >with clear_modify=1, which does two things:
> >- sbusy the pages (sbusy pages can get new read-only mappings, but cannot
> >  be mapped rw)
> >- pmap_remove_write() on the pages to invalidate all current writeable
> >  mappings.
> >
> >This state should be kept until ktls is completely done with the pages.
> I am now thinking that this is done exactly as you describe above and
> doesn't require any changes.
> 
> The change I am planning is below the strategy routine in the function
> that does the write RPC.
> It currently copies the data from the buffer into mbuf clusters.
> After this change, it would put the physical page #s for the buffer in the
> mbuf(s) and then wait for them all to be m_ext_ext_free()d before calling
> bufdone().
> --> The only difference is the wait before the bufdone() call in the RPC layer
>        below the strategy routine. (bufdone() is the only call the NFS client
>        seems to do below the strategy routine, so I assume it ends the state
>        you describe above?)
> 
As far as pages are put into mbuf clusters only after bwrite() that
did vfs_busy_pages(), and bufdone() is called not earlier than network
finished with the mbufs, it should be ok.