Re: Race condition in debugger?

From: Peter Edwards <peadar.edwards_at_gmail.com> Date: Mon, 18 Apr 2005 01:34:19 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:32 UTC

[Very late response: I just experienced the same problem and
remembered the issue had been brought up before]

On 2/14/05, Greg 'groggy' Lehey <grog_at_freebsd.org> wrote:
> I'm having some problems with userland gdb on recent -CURRENT builds:
> at some point it hangs.
> 
> Specifically, I'm setting a conditional breakpoint like this:
> 
>   b Minsert_blockletpointer if I->inode_num == 0x1f0bb
> 
> inode_num increments for 1, so I hit this breakpoint about 100,000
> times.  Or I should.  What happens is that the debugger hangs at some
> point on the way.  ktrace shows multiple copies of:
> 
>  12325 gdb      CALL  ptrace(12,0x3026,0xbfbfd5e0,0)
>  12325 gdb      RET   ptrace 0
>  12325 gdb      CALL  ptrace(PT_STEP,0x3026,0x1,0)
>  12325 gdb      RET   ptrace 0
>  12325 gdb      CALL  wait4(0xffffffff,0xbfbfd808,0,0)  <-- stops here
>  12325 gdb      RET   wait4 12326/0x3026
>  12325 gdb      CALL  kill(0x3026,0)
>  12325 gdb      RET   kill 0
>  12325 gdb      CALL  ptrace(PT_GETREGS,0x3026,0xbfbfd5c0,0)
> 
> When it hangs, it's at the call to wait4, as shown.  It looks like the
> completion of the ptrace request isn't being reported back.

I think I know what's going on with this, and I have a feeling that
there's a couple of other wait()-related issues that were left open on
the lists that might be explained by the issue.

Here's my hypothesis: kern_wait() checks each child of the current
process to see if they have exited, or should otherwise report status
to wait/wait3/wait4/waitpid, If it finds that all candidate children
have nothing to report, it goes asleep, waiting to be awoken by the/a
child reporting status, and repeats the process: it looks a bit like
this:

kern_wait()
{
loop:
    foreach child of self {
        if (child has status to report)
            return status;
    }
    lock self
    msleep(on "self")
    unlock self
    goto loop;
}

Problem is, that there's no lock protecting that the conditions in the
inner loop hold by the time the current process locks its own "struct
proc" and invokes msleep(). (It's probably most likely the race will
happen on an SMP machine or with PREEMPTION, but the aquiry of
curproc's lock could possibly cause the issue if it needed to sleep.),
i.e., you can miss the wakeup generated by a particular child between
checking the process in the inner loop, and going to sleep.

I can at least reproduce this for the ptrace/gdb case, but AFAICT, it
could happen for the standard wait()/exit() path, too. I worked up a
patch to fix the problem by having those parts of the kernel that wake
the process up flag the fact in the parent's flags and doing the
wakeup while holding tha parent process lock, and noticing if this
flag has been set before sleeping. (A simpler solution would be to
hold the parent lock across the bulk of kern_wait, but from what I can
gather this will lead to at least one LOR)

I've been unable to reproduce the problem with a kernel with this
patch, and using a nice sprinkling of printfs can show that when GDB
hangs, the race has just occurred.

Anyone got opinions on this?
Cheers,
Peadar.