Re: Race condition in debugger?

From: David Xu <davidxu_at_freebsd.org> Date: Mon, 18 Apr 2005 13:26:44 +0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:32 UTC

Peter Edwards wrote:

>[Very late response: I just experienced the same problem and
>remembered the issue had been brought up before]
>
>On 2/14/05, Greg 'groggy' Lehey <grog_at_freebsd.org> wrote:
>  
>
>>I'm having some problems with userland gdb on recent -CURRENT builds:
>>at some point it hangs.
>>
>>Specifically, I'm setting a conditional breakpoint like this:
>>
>>  b Minsert_blockletpointer if I->inode_num == 0x1f0bb
>>
>>inode_num increments for 1, so I hit this breakpoint about 100,000
>>times.  Or I should.  What happens is that the debugger hangs at some
>>point on the way.  ktrace shows multiple copies of:
>>
>> 12325 gdb      CALL  ptrace(12,0x3026,0xbfbfd5e0,0)
>> 12325 gdb      RET   ptrace 0
>> 12325 gdb      CALL  ptrace(PT_STEP,0x3026,0x1,0)
>> 12325 gdb      RET   ptrace 0
>> 12325 gdb      CALL  wait4(0xffffffff,0xbfbfd808,0,0)  <-- stops here
>> 12325 gdb      RET   wait4 12326/0x3026
>> 12325 gdb      CALL  kill(0x3026,0)
>> 12325 gdb      RET   kill 0
>> 12325 gdb      CALL  ptrace(PT_GETREGS,0x3026,0xbfbfd5c0,0)
>>
>>When it hangs, it's at the call to wait4, as shown.  It looks like the
>>completion of the ptrace request isn't being reported back.
>>    
>>
>
>I think I know what's going on with this, and I have a feeling that
>there's a couple of other wait()-related issues that were left open on
>the lists that might be explained by the issue.
>
>Here's my hypothesis: kern_wait() checks each child of the current
>process to see if they have exited, or should otherwise report status
>to wait/wait3/wait4/waitpid, If it finds that all candidate children
>have nothing to report, it goes asleep, waiting to be awoken by the/a
>child reporting status, and repeats the process: it looks a bit like
>this:
>
>kern_wait()
>{
>loop:
>    foreach child of self {
>        if (child has status to report)
>            return status;
>    }
>    lock self
>    msleep(on "self")
>    unlock self
>    goto loop;
>}
>
>Problem is, that there's no lock protecting that the conditions in the
>inner loop hold by the time the current process locks its own "struct
>proc" and invokes msleep(). (It's probably most likely the race will
>happen on an SMP machine or with PREEMPTION, but the aquiry of
>curproc's lock could possibly cause the issue if it needed to sleep.),
>i.e., you can miss the wakeup generated by a particular child between
>checking the process in the inner loop, and going to sleep.
>
>I can at least reproduce this for the ptrace/gdb case, but AFAICT, it
>could happen for the standard wait()/exit() path, too. I worked up a
>patch to fix the problem by having those parts of the kernel that wake
>the process up flag the fact in the parent's flags and doing the
>wakeup while holding tha parent process lock, and noticing if this
>flag has been set before sleeping. (A simpler solution would be to
>hold the parent lock across the bulk of kern_wait, but from what I can
>gather this will lead to at least one LOR)
>
>I've been unable to reproduce the problem with a kernel with this
>patch, and using a nice sprinkling of printfs can show that when GDB
>hangs, the race has just occurred.
>
>Anyone got opinions on this?
>Cheers,
>Peadar.
>  
>
If the parent has PS_NOCLDSTOP set, no SIGCHLD will be sent to parent, 
so there
is race in the case, but if PS_NOCLDSTOP is set, the signal will be sent 
to parent,
and parant should resume from msleep() immediately.
I don't know why it still does have race, I am looking the code, I think 
stop() should
be merged into thread_stopped(), there is no another caller at all.

David Xu