Re: should a copy_file_range(2) syscall be interrupted via a signal

From: Konstantin Belousov <kostikbel_at_gmail.com> Date: Sat, 6 Jul 2019 00:13:09 +0300 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:21 UTC

On Fri, Jul 05, 2019 at 08:59:23PM +0000, Rick Macklem wrote:
> Konstantin Belousov wrote:
> >On Fri, Jul 05, 2019 at 07:30:54PM +0200, Jilles Tjoelker wrote:
> >> On Fri, Jul 05, 2019 at 12:28:51AM +0000, Rick Macklem wrote:
> >> > I have been working on a Linux compatible copy_file_range(2) syscall
> >> > (the current code can be found at https://reviews.freebsd.org/D20584).
> >>
> >> > One outstanding issue is how it should deal with signals. Right now, I
> >> > have vn_start_write() without PCATCH, so that it won't be interrupted
> >> > by a signal, but I notice that vn_write() {ie. write syscall } does
> >> > have PCATCH on vn_start_write() and so does vn_rdwr() when it is
> >> > called without IO_NODELOCKED.
> >>
> >> A regular write() is only interruptible when writing to a terminal,
> >> pseudo-terminal master, pipe, socket, or, under certain conditions, a
> >> file on an NFS intr mount. Therefore, applications may not have the code
> >> to resume interrupted writes to regular files gracefully.
> Yes, agreed. Since this syscall only works on VREG vnodes, the only weird cases
> are NFS (and maybe fuse). I'll let asomers_at_ address the fuse situation.
> 
> >>
> >> > I am thinking that copy_file_range(2) should do this also.
> >> > However, if it returns an error, it is impossible for the caller to
> >> > know how much of the data range got copied.
> >>
> >> A regular write() returns partial success if interrupted by a signal
> >> when it has already written something. Therefore, the application can
> >> resume the operation by adjusting pointers and counts.
> >>
> >> Something similar applies to "deterministic" errors like [EFBIG] where
> >> the first call will write as far as possible (if this is not nothing)
> >> successfully and the next attempt will return the error.
> >>
> >> > What do you think the copy_file_range(2) code should do?
> >>
> >> I'm not sure it should actually be done, but the need for adjusting
> >> pointers and counts could be avoided with a little extra kernel and libc
> >> code. The system call would receive an additional argument pointing to
> >> an off_t that indicates how many bytes previous calls have already
> >> written. A libc wrapper would initialize this to 0. With this, the
> >> system call can be restarted automatically after a signal.
> >>
> >> In any case, [EINTR] and the internal ERESTART must not be returned
> >> unless it is safe to repeat the call with the same (direct) arguments.
> Well, since the copy_file_range(2) syscall is allowed to return fewer bytes copied
> than requested and this doesn't mean EOF, it seems that doing that would
> achieve the result of allowing an application to call it again.
> (Basically, it must be used in a loop until the bytes of the range have been copied,
>  since returning fewer bytes copied than requested is a normal outcome.)
> 
> >BTW, if the syscall is made interruptible, it should be made cancellable ?
> Not sure what you mean by "cancellable"? If you mean "terminated by a signal
> where there has been no change to the output file, then that could only easily be
> done by returning EINTR before any data has been copied.
> If you mean something else, then I'd need to know what that is?
See pthread_setcancelstate(3) for start, but the POSIX 1003.1-2017
2.9.5 Thread Cancellation is the definitive spec, including the quite
readable overview.

> 
> >I think that PCATCH commonly used for vn_start_write(9) is not the best
> >decision.  It is safe in the sense explained by Jilles, since its interruption
> >only happens at the very beginning of the syscall, but it contradict to the
> >tradition of write(2) to the local fs being not interruptible.
> >
> >I suggest to not make the syscall interruptible by default, and perhaps
> >only allow it with a flag.  Then you would need to explain that the
> >syscall is only interruptible between VOPs, it is up to fs to decide if
> >the VOP_READ/VOP_WRITE is interruptible (e.g. devfs and nfs).
> This is how it is coded now. The one thing I have noticed is that a
> copy_file_range() can take a long time (about 2min for 2Gbytes on the old hardware
> I test on). This seems like a long delay for <crtl>C when you do that to an application
> copying a large file. ("cp" and "dd" also take 2min for 2Gbytes, so it isn't a bug
> in copy_file_range(2). It just introduces a long delay in response to <crtl>C.)
That long delay is inconvenience but not something that we should spent
too much time trying to fix. We cause the same delay if program does a
write(2) of several GB, or when very large process like firefox dumps
core.