On 3/9/2017 3:47 PM, Bryan Drewery wrote: > On 3/9/2017 3:11 PM, Jilles Tjoelker wrote: >> On Thu, Mar 09, 2017 at 04:46:46PM +0200, Konstantin Belousov wrote: >>> Yes, there is a race, apparently, with the child zombie still not finishing >>> sending the SIGCHLD to the parent and parent exiting. The following should >>> fix the issue, but I do not think that reproducing the problem is easy. >> >>> diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c >>> index c524fe5df37..ba5ff84e9de 100644 >>> --- a/sys/kern/kern_exit.c >>> +++ b/sys/kern/kern_exit.c >>> _at__at_ -189,6 +189,7 _at__at_ exit1(struct thread *td, int rval, int signo) >>> { >>> struct proc *p, *nq, *q, *t; >>> struct thread *tdt; >>> + ksiginfo_t ksi; >>> >>> mtx_assert(&Giant, MA_NOTOWNED); >>> KASSERT(rval == 0 || signo == 0, ("exit1 rv %d sig %d", rval, signo)); >>> _at__at_ -456,7 +457,12 _at__at_ exit1(struct thread *td, int rval, int signo) >>> proc_reparent(q, q->p_reaper); >>> if (q->p_state == PRS_ZOMBIE) { >>> PROC_LOCK(q->p_reaper); >>> - pksignal(q->p_reaper, SIGCHLD, q->p_ksi); >>> + if (q->p_ksi != NULL) { >>> + ksiginfo_init(&ksi); >>> + ksiginfo_copy(q->p_ksi, &ksi); >>> + } >>> + pksignal(q->p_reaper, SIGCHLD, q->p_ksi != >>> + NULL ? &ksi : NULL); >>> PROC_UNLOCK(q->p_reaper); >>> } >>> } else { > > I just got something weird with this patch that wasn't happening before: > > /usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j > exp-10amd64 -p commit -z test devel/ccache > [poudriere runs and completes with exit status 0] >> time: command terminated abnormally >> 28.08 real 9.92 user 10.38 sys >> 23464 maximum resident set size >> 4996 average shared memory size >> 88 average unshared data size >> 127 average unshared stack size >> 282705 page reclaims >> 5623 page faults >> 0 swaps >> 2673 block input operations >> 4836 block output operations >> 33 messages sent >> 0 messages received >> 37 signals received >> 11226 voluntary context switches >> 780 involuntary context switches >> zsh: alarm /usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j exp-10amd64 > exit status: 142 (SIGALRM). > > I don't see time(1) using SIGALRM or proc reaper at all. > > Rerunning it, and trying other simpler test cases, does not produce the > same result. It may be some race unrelated to this patch, dunno. > I'm consistently getting foreground processes getting the wrong signals now. I'm removing this patch for now. > >> >> This patch introduces a subtle correctness bug. A real SIGCHLD ksiginfo >> should always be the zombie's p_ksi; otherwise, the siginfo may be lost >> if there are too many signals pending for the target process or in the >> system. If the siginfo is lost and the reaper normally passes si_pid to >> waitpid() or similar (instead of passing WAIT_ANY or P_ALL), a zombie >> will remain until the reaper terminates. >> >> Conceptually the siginfo is sent to one process at a time only, so the >> bug is an artifact of the implementation. Perhaps the piece of code >> added in r309886 can be moved or the ksiginfo can be removed from the >> parent's queue. >> >> If such a fix is not possible, it may be better to send a bare SIGCHLD >> (si_code is SI_KERNEL or 0, depending on how many signals are pending) >> in this situation and document that reapers must use WAIT_ANY or P_ALL. >> (However, compared to the pre-r309886 situation they can still use >> SIGCHLD to get notified when to call waitpid() or similar.) >> > > -- Regards, Bryan Drewery
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC