ULE can leak TDQ_LOCK() if statclock() called outside of critical_enter()

From: Ryan Stone <rysto32_at_gmail.com>
Date: Fri, 18 Jan 2013 11:56:27 -0500
I have been experiencing occasional deadlocks on FreeBSD 8.2 systems using
the ULE scheduler.  The root cause in every case has been that ULE's
TDQ_LOCK for cpu 0 is owned by a thread that is not running.  I have been
investigating the issue, and I believe that I see the issue.  The problem
occurs if the interrupt that drives statclock does not call
critical_enter() upon calling into statclock().  The lapic timer does use
critical_enter(), so default configurations would not see this.  I have
local patches to use the RTC to drive statclock, and from a quick reading
of the eventtimer code in -CURRENT the same thing is possible there.  The
RTC code does not call statclock within a critical section.  So here's the
bug:

1) A thread with interrupts enabled, running on CPU 0, with td_owepreempt=1
and td_critnest=1 calls critical_exit():

void
critical_exit(void)
{
   // ...
    if (td->td_critnest == 1) {
        td->td_critnest = 0;
        if (td->td_owepreempt && !kdb_active) {
// Irrelevant bits snipped

2) td_critnest is set to 0, and then the RTC interrupt fires.

3) rtcintr calls into statclock (8.2) or statclock_cnt(head) with
td_critnest still 0 (on head it goes through the eventtimer code, but it
ends up in statclock eventually).

4) statclock takes the thread_lock on curthread, then calls sched_clock().
sched_clock calls sched_balance();

static void
sched_balance(void)
{
    // snip...
    tdq = TDQ_SELF();
    TDQ_UNLOCK(tdq);
    sched_balance_group(cpu_top);
    TDQ_LOCK(tdq);
}

TDQ_UNLOCK does a spinlock_exit which does a critical_exit.  td_critnest
will be decremented back to 0 and td_owepreempt is still 1, so this
triggers a preemption.  Note that this TDQ_UNLOCK is (intentionally)
unlocking the thread_lock done by statclock.

5) thread migrates to any other CPU, call it CPU n.  tdq is now stale.
TDQ_LOCK takes the lock for CPU 0 (but really it's intending to re-take the
thread_lock, although a thread_lock() here would be equally incorrect --
sched_balance's caller is going to be mucking around with the TDQ when
sched_balance returns).

6) The thread returns to statclock.  statclock does a thread_unlock(). The
td_lock is TDQ_LOCK(n), which we don't hold.  We mangle the stat of
TDQ_LOCK(n) and leave TDQ_LOCK(0) held.


The simplest solution would be to do a critical_enter() in sched_balance,
although that would be superfluous in the normal case where the lapic timer
is driving statclock.  I'm not sure if there's other code in the
eventtimers infrastructure that's assuming that a preemption or migration
is impossible while handling an event.  A quick look at kern_clocksource.c
turns up worrying comments like "Handle all events for specified time on
this CPU" and uses of curcpu, so there may well be other issues lurking
here.

It looks to me that the safest thing to do would be to push the
critical_enter() into the eventtimer code or even all the way back to the
interrupt handlers (mirroring what the lapic code already does).
Received on Fri Jan 18 2013 - 15:56:33 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:34 UTC