Re: should a copy_file_range(2) syscall be interrupted via a signal

From: Rick Macklem <>
Date: Fri, 5 Jul 2019 20:59:23 +0000
Konstantin Belousov wrote:
>On Fri, Jul 05, 2019 at 07:30:54PM +0200, Jilles Tjoelker wrote:
>> On Fri, Jul 05, 2019 at 12:28:51AM +0000, Rick Macklem wrote:
>> > I have been working on a Linux compatible copy_file_range(2) syscall
>> > (the current code can be found at
>> > One outstanding issue is how it should deal with signals. Right now, I
>> > have vn_start_write() without PCATCH, so that it won't be interrupted
>> > by a signal, but I notice that vn_write() {ie. write syscall } does
>> > have PCATCH on vn_start_write() and so does vn_rdwr() when it is
>> > called without IO_NODELOCKED.
>> A regular write() is only interruptible when writing to a terminal,
>> pseudo-terminal master, pipe, socket, or, under certain conditions, a
>> file on an NFS intr mount. Therefore, applications may not have the code
>> to resume interrupted writes to regular files gracefully.
Yes, agreed. Since this syscall only works on VREG vnodes, the only weird cases
are NFS (and maybe fuse). I'll let asomers_at_ address the fuse situation.

>> > I am thinking that copy_file_range(2) should do this also.
>> > However, if it returns an error, it is impossible for the caller to
>> > know how much of the data range got copied.
>> A regular write() returns partial success if interrupted by a signal
>> when it has already written something. Therefore, the application can
>> resume the operation by adjusting pointers and counts.
>> Something similar applies to "deterministic" errors like [EFBIG] where
>> the first call will write as far as possible (if this is not nothing)
>> successfully and the next attempt will return the error.
>> > What do you think the copy_file_range(2) code should do?
>> I'm not sure it should actually be done, but the need for adjusting
>> pointers and counts could be avoided with a little extra kernel and libc
>> code. The system call would receive an additional argument pointing to
>> an off_t that indicates how many bytes previous calls have already
>> written. A libc wrapper would initialize this to 0. With this, the
>> system call can be restarted automatically after a signal.
>> In any case, [EINTR] and the internal ERESTART must not be returned
>> unless it is safe to repeat the call with the same (direct) arguments.
Well, since the copy_file_range(2) syscall is allowed to return fewer bytes copied
than requested and this doesn't mean EOF, it seems that doing that would
achieve the result of allowing an application to call it again.
(Basically, it must be used in a loop until the bytes of the range have been copied,
 since returning fewer bytes copied than requested is a normal outcome.)

>BTW, if the syscall is made interruptible, it should be made cancellable ?
Not sure what you mean by "cancellable"? If you mean "terminated by a signal
where there has been no change to the output file, then that could only easily be
done by returning EINTR before any data has been copied.
If you mean something else, then I'd need to know what that is?

>I think that PCATCH commonly used for vn_start_write(9) is not the best
>decision.  It is safe in the sense explained by Jilles, since its interruption
>only happens at the very beginning of the syscall, but it contradict to the
>tradition of write(2) to the local fs being not interruptible.
>I suggest to not make the syscall interruptible by default, and perhaps
>only allow it with a flag.  Then you would need to explain that the
>syscall is only interruptible between VOPs, it is up to fs to decide if
>the VOP_READ/VOP_WRITE is interruptible (e.g. devfs and nfs).
This is how it is coded now. The one thing I have noticed is that a
copy_file_range() can take a long time (about 2min for 2Gbytes on the old hardware
I test on). This seems like a long delay for <crtl>C when you do that to an application
copying a large file. ("cp" and "dd" also take 2min for 2Gbytes, so it isn't a bug
in copy_file_range(2). It just introduces a long delay in response to <crtl>C.)

Received on Fri Jul 05 2019 - 18:59:25 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:21 UTC