On Fri, Jan 04, 2019 at 07:56:42AM +0100, Michal Meloun wrote: > On 29.12.2018 18:47, Dennis Clarke wrote: > > On 12/28/18 9:56 PM, Mark Millard via freebsd-arm wrote: > >> > >> On 2018-Dec-28, at 12:12, Mark Millard <marklmi at yahoo.com> wrote: > >> > >>> On 2018-Dec-28, at 05:13, Michal Meloun <melounmichal at gmail.com> > >>> wrote: > >>> > >>>> Mark, > >>>> this is known problem with qemu-user-static. > >>>> Emulation of every single interruptible syscall is broken by design (it > >>>> have signal related races). Theses races cannot be solved without major > >>>> rewrite of syscall emulation code. > >>>> Unfortunately, nobody actively works on this, I think. > >>>> > > Following along here quietly and I had to blink at this a few times. > > Is there a bug report somewhere within the qemu world related to this > > 'broken by design' qemu feature? > Firstly, I apologize for late answer. Writing a technically accurate but > still comprehensible report is extremely difficult for me. > Major design issue with qemu-user is the fact that guest (blocking / > interruptible) syscalls must be emulated atomically, including > delivering of asynchronous signals (including signals originated by > other thread). > This is something that cannot be emulated precisely by user mode > program, without specific kernel support. Let me explain this in a > little more details. > [snip] > This look a much better. The code blocks all signals first, then checks > if any signal is pending. If yes, then does not-blocking select() > (because timeout is zero) and correctly returns EINTR immediately. > Otherwise, it uses other variant of select(), pselect() which adjusts > right signal mask itself. > That's mean that syscall is called with blocked signal delivery, but > kernel adjusts right sigmask before it waits for event. While this looks > like perfect solution and this code closes all races from first version, > then it doesn't. pselect() uses different semantic that select(), it > doesn't update timeout argument. So this solution is also inappropriate. FreeBSD select() never updates the passed timeout. When emulating Linux syscalls, this will have to be done manually. > Moreover, I think, we don't have p<foo> equivalents for all blocking > syscalls. We definitely do not. For example, open() has no equivalent with a signal mask. > Mark, I hope that this is also the answer to your question posted to > hackers_at_ and also the exploitation why you see hang. > Linux uses different approach to overcome this issue, safe_syscall -> > https://gitlab.collabora.com/tomeu/qemu/commit/4d330cee37a21aabfc619a1948953559e66951a4 > It looks like workable workaround, but I'm not sure about ERESTART > versus EINTR return values. Imho, this can be problem. This looks like a reasonable solution. Musl libc uses the same approach to implement pthread cancellation (where with the default "deferred" cancellation type, cancellation takes effect at cancellation points only, which include most blocking system calls; if a cancellation request comes in at the same time as a blocking cancellation point system call starts, the same race condition needs to be avoided). As for ERESTART vs EINTR, EINTR can be treated like any other error. On the other hand, ERESTART (or variants like ERESTARTSYS) is never returned by the kernel, but instead causes the kernel to rewind the program counter (so the system call instruction will be executed again) just before invoking the signal handler. Therefore, when the host kernel does this to qemu, qemu must do the same to the guest. If a signal is delivered just before qemu makes a system call on behalf of the guest, this may look like ERESTART. This is fine since it looks the same as if the signal was delivered just before the guest's system call instruction. The approach as used by FreeBSD libc to implement pthread cancellation (thr_wake(2) on self in the signal handler) will not let you distinguish between ERESTART and EINTR, so you would have to replicate that determination (which typically but not always depends on the signal's SA_RESTART flag and which system call it is). Therefore, I would not recommend that approach. > I have list of other qemu-user problems (I mean mainly a bsd-user part > of qemu code here), not counting normal coding bugs: > - code is not thread safety but is used in threaded environment (rw > locks for example), > - emulate some sysctl's and resource limits / usage behavior is very > hard (mainly if we emulate 32-bits guest on 64-bits host) In many such cases, the proper behaviour can be found in the kernel code (when a 64-bit kernel needs to handle a system call from a 32-bit process). I expect problems with getdirentries() and struct dirent.d_off with filesystems that return hashed filenames as positions. > - if host syscall returns ERESTART, we should do full unroll and pass it > to guest. Yes (with the above mentioned caveats about how ERESTART is returned). > - the syscalls emulation should not use the libc functions, but syscall > instruction directly. Libc shims can have side effects so we should not > to execute it twice. Once in guest, second time in host. If you accept that your code is going to be more tightly coupled to libc and the kernel than most applications, calling system calls directly should be fine. This will also allow you to install your own handler for SIGTHR if you do not want to remap it. Do not expect pthread cancellation and suspension to work properly in such a configuration, though. > - and last major one. At this time, all guest structures are maintained > by hand. Due to huge amount of these structures, this is the extreme > error prone approach. We should convert this to script generated code, > including guest syscalls definition. Definitions of system calls are in syscalls.master and should be automatically processable; definitions of types are in header files and cannot really be processed other than by a C compiler. > Again, my apology for slightly (or much) chaotic report, but this is the > best what's I capable. It was clear enough to me. -- Jilles TjoelkerReceived on Sat Jan 05 2019 - 22:27:44 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:19 UTC