Re: posix_fallocate on ZFS

From: Patrick Kelsey <pkelsey_at_freebsd.org>
Date: Tue, 13 Feb 2018 21:11:34 -0500
On Mon, Feb 12, 2018 at 12:04 PM, John Baldwin <jhb_at_freebsd.org> wrote:

> On Saturday, February 10, 2018 01:46:33 PM Garrett Wollman wrote:
> > In article
> > <CAOtMX2jZr_kvJgOZWeiB-AZ3-7-uUu+UQ3P0nKhGZ0eNRzwMOQ_at_mail.gmail.com>,
> > asomers_at_freebsd.org writes:
> >
> > >On Sat, Feb 10, 2018 at 10:28 AM, Willem Jan Withagen <wjw_at_digiware.nl>
> > >wrote:
> >
> > >> Is there any expectation that this is going to fixed in any near
> future?
> >
> > >No.  It's fundamentally impossible to support posix_fallocate on a COW
> > >filesystem like ZFS.  Ceph should be taught to ignore an EINVAL result,
> > >since the system call is merely advisory.
> >
> > I don't think it's true that this is _fundamentally_ impossible.  What
> > the standard requires would in essence be a per-object refreservation.
> > ZFS supports refreservation, obviously, but not on a per-object basis.
> > Furthermore, there are mechanisms to preallocate blocks for things
> > like dumps.  So it *could* be done (as in, the concept is there), but
> > it may not be practical.  (And ultimately, there are ways in which the
> > administrator might manage the system that would defeat the desired
> > effect, but that's out of the standard's scope.)  Given the semantic
> > mismatch, though, I suspect it's unreasonable to expect anyone to
> > prioritize implementation of such a feature.
>
> I don't think posix_fallocate() can be compatible with COW.  Suppose you
> do reserve a fixed set of blocks.  That ensures the first write has a
> place to write, but not if you overwrite one of those blocks.  You'd have
> to reserve another block to maintain the reservation each time you wrote
> to a block, or you'd have to have a way to mark a file as not COW.  The
> first case isn't really any better than not using posix_fallocate() in the
> first place as you are still requiring writes to allocate blocks, and the
> second seems a bit fraught with peril as well if the application is
> expecting the non-COW'd file to be in sync with other files in the system
> since presumably non-COW'd files couldn't be snapshotted, etc.
>
>
I think Garrett's assessment that it is not fundamentally impossible, but
may not be felt to be worth implementing in any given file system for
practical reasons, is correct.  I say this having designed/implemented a
COW file system that was driven by customer pressure to do things that at
first pass one might declare represented an architectural contradiction,
but upon further reflection were entirely possible to do given sufficient
willingness to invest the effort and accept the accompanying trade-offs,
additional knobs to turn, etc.

In this case (posix_fallocate() + COW + snapshots), it could be implemented
with a per-object allocator that normally keeps at least one extra block
beyond the reservation requirement on hand, plus a snapshot operation that
in order to succeed has to be able to provision the local allocators of all
fallocated objects with enough additional blocks to maintain the no-fail
write guarantee post-snapshot.

-Patrick
Received on Wed Feb 14 2018 - 01:11:37 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:14 UTC