Re: r314708: panic: tdsendsignal: ksi on queue

From: Bryan Drewery <bdrewery_at_FreeBSD.org> Date: Thu, 9 Mar 2017 15:47:32 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC

On 3/9/2017 3:11 PM, Jilles Tjoelker wrote:
> On Thu, Mar 09, 2017 at 04:46:46PM +0200, Konstantin Belousov wrote:
>> Yes, there is a race, apparently, with the child zombie still not finishing
>> sending the SIGCHLD to the parent and parent exiting.  The following should
>> fix the issue, but I do not think that reproducing the problem is easy.
> 
>> diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c
>> index c524fe5df37..ba5ff84e9de 100644
>> --- a/sys/kern/kern_exit.c
>> +++ b/sys/kern/kern_exit.c
>> _at__at_ -189,6 +189,7 _at__at_ exit1(struct thread *td, int rval, int signo)
>>  {
>>  	struct proc *p, *nq, *q, *t;
>>  	struct thread *tdt;
>> +	ksiginfo_t ksi;
>>  
>>  	mtx_assert(&Giant, MA_NOTOWNED);
>>  	KASSERT(rval == 0 || signo == 0, ("exit1 rv %d sig %d", rval, signo));
>> _at__at_ -456,7 +457,12 _at__at_ exit1(struct thread *td, int rval, int signo)
>>  			proc_reparent(q, q->p_reaper);
>>  			if (q->p_state == PRS_ZOMBIE) {
>>  				PROC_LOCK(q->p_reaper);
>> -				pksignal(q->p_reaper, SIGCHLD, q->p_ksi);
>> +				if (q->p_ksi != NULL) {
>> +					ksiginfo_init(&ksi);
>> +					ksiginfo_copy(q->p_ksi, &ksi);
>> +				}
>> +				pksignal(q->p_reaper, SIGCHLD, q->p_ksi !=
>> +				    NULL ? &ksi : NULL);
>>  				PROC_UNLOCK(q->p_reaper);
>>  			}
>>  		} else {

I just got something weird with this patch that wasn't happening before:

/usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j
exp-10amd64 -p commit -z test devel/ccache
[poudriere runs and completes with exit status 0]
> time: command terminated abnormally                                                         
>        28.08 real         9.92 user        10.38 sys                                        
>      23464  maximum resident set size                                                       
>       4996  average shared memory size                                                      
>         88  average unshared data size                                                      
>        127  average unshared stack size                                                     
>     282705  page reclaims                                                                   
>       5623  page faults                                                                     
>          0  swaps                                                                           
>       2673  block input operations                                                          
>       4836  block output operations                                                         
>         33  messages sent                                                                   
>          0  messages received                                                               
>         37  signals received                                                                
>      11226  voluntary context switches                                                      
>        780  involuntary context switches                                                    
> zsh: alarm      /usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j exp-10amd64
exit status: 142 (SIGALRM).

I don't see time(1) using SIGALRM or proc reaper at all.

Rerunning it, and trying other simpler test cases, does not produce the
same result.  It may be some race unrelated to this patch, dunno.

> 
> This patch introduces a subtle correctness bug. A real SIGCHLD ksiginfo
> should always be the zombie's p_ksi; otherwise, the siginfo may be lost
> if there are too many signals pending for the target process or in the
> system. If the siginfo is lost and the reaper normally passes si_pid to
> waitpid() or similar (instead of passing WAIT_ANY or P_ALL), a zombie
> will remain until the reaper terminates.
> 
> Conceptually the siginfo is sent to one process at a time only, so the
> bug is an artifact of the implementation. Perhaps the piece of code
> added in r309886 can be moved or the ksiginfo can be removed from the
> parent's queue.
> 
> If such a fix is not possible, it may be better to send a bare SIGCHLD
> (si_code is SI_KERNEL or 0, depending on how many signals are pending)
> in this situation and document that reapers must use WAIT_ANY or P_ALL.
> (However, compared to the pre-r309886 situation they can still use
> SIGCHLD to get notified when to call waitpid() or similar.)
> 

-- 
Regards,
Bryan Drewery