Re: r314708: panic: tdsendsignal: ksi on queue

From: Bryan Drewery <bdrewery_at_FreeBSD.org>
Date: Thu, 9 Mar 2017 16:59:06 -0800
On 3/9/2017 3:57 PM, Bryan Drewery wrote:
> On 3/9/2017 3:47 PM, Bryan Drewery wrote:
>> On 3/9/2017 3:11 PM, Jilles Tjoelker wrote:
>>> On Thu, Mar 09, 2017 at 04:46:46PM +0200, Konstantin Belousov wrote:
>>>> Yes, there is a race, apparently, with the child zombie still not finishing
>>>> sending the SIGCHLD to the parent and parent exiting.  The following should
>>>> fix the issue, but I do not think that reproducing the problem is easy.
>>>
>>>> diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c
>>>> index c524fe5df37..ba5ff84e9de 100644
>>>> --- a/sys/kern/kern_exit.c
>>>> +++ b/sys/kern/kern_exit.c
>>>> _at__at_ -189,6 +189,7 _at__at_ exit1(struct thread *td, int rval, int signo)
>>>>  {
>>>>  	struct proc *p, *nq, *q, *t;
>>>>  	struct thread *tdt;
>>>> +	ksiginfo_t ksi;
>>>>  
>>>>  	mtx_assert(&Giant, MA_NOTOWNED);
>>>>  	KASSERT(rval == 0 || signo == 0, ("exit1 rv %d sig %d", rval, signo));
>>>> _at__at_ -456,7 +457,12 _at__at_ exit1(struct thread *td, int rval, int signo)
>>>>  			proc_reparent(q, q->p_reaper);
>>>>  			if (q->p_state == PRS_ZOMBIE) {
>>>>  				PROC_LOCK(q->p_reaper);
>>>> -				pksignal(q->p_reaper, SIGCHLD, q->p_ksi);
>>>> +				if (q->p_ksi != NULL) {
>>>> +					ksiginfo_init(&ksi);
>>>> +					ksiginfo_copy(q->p_ksi, &ksi);
>>>> +				}
>>>> +				pksignal(q->p_reaper, SIGCHLD, q->p_ksi !=
>>>> +				    NULL ? &ksi : NULL);
>>>>  				PROC_UNLOCK(q->p_reaper);
>>>>  			}
>>>>  		} else {
>>
>> I just got something weird with this patch that wasn't happening before:
>>
>> /usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j
>> exp-10amd64 -p commit -z test devel/ccache
>> [poudriere runs and completes with exit status 0]
>>> time: command terminated abnormally                                                         
>>>        28.08 real         9.92 user        10.38 sys                                        
>>>      23464  maximum resident set size                                                       
>>>       4996  average shared memory size                                                      
>>>         88  average unshared data size                                                      
>>>        127  average unshared stack size                                                     
>>>     282705  page reclaims                                                                   
>>>       5623  page faults                                                                     
>>>          0  swaps                                                                           
>>>       2673  block input operations                                                          
>>>       4836  block output operations                                                         
>>>         33  messages sent                                                                   
>>>          0  messages received                                                               
>>>         37  signals received                                                                
>>>      11226  voluntary context switches                                                      
>>>        780  involuntary context switches                                                    
>>> zsh: alarm      /usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j exp-10amd64
>> exit status: 142 (SIGALRM).
>>
>> I don't see time(1) using SIGALRM or proc reaper at all.
>>
>> Rerunning it, and trying other simpler test cases, does not produce the
>> same result.  It may be some race unrelated to this patch, dunno.
>>
> 
> I'm consistently getting foreground processes getting the wrong signals
> now. I'm removing this patch for now.

It wasn't this patch doing this. Something else is very wrong with
signal handling right now.

> 
>>
>>>
>>> This patch introduces a subtle correctness bug. A real SIGCHLD ksiginfo
>>> should always be the zombie's p_ksi; otherwise, the siginfo may be lost
>>> if there are too many signals pending for the target process or in the
>>> system. If the siginfo is lost and the reaper normally passes si_pid to
>>> waitpid() or similar (instead of passing WAIT_ANY or P_ALL), a zombie
>>> will remain until the reaper terminates.
>>>
>>> Conceptually the siginfo is sent to one process at a time only, so the
>>> bug is an artifact of the implementation. Perhaps the piece of code
>>> added in r309886 can be moved or the ksiginfo can be removed from the
>>> parent's queue.
>>>
>>> If such a fix is not possible, it may be better to send a bare SIGCHLD
>>> (si_code is SI_KERNEL or 0, depending on how many signals are pending)
>>> in this situation and document that reapers must use WAIT_ANY or P_ALL.
>>> (However, compared to the pre-r309886 situation they can still use
>>> SIGCHLD to get notified when to call waitpid() or similar.)
>>>
>>
>>
> 
> 


-- 
Regards,
Bryan Drewery


Received on Thu Mar 09 2017 - 23:59:40 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC