Re: ULE can leak TDQ_LOCK() if statclock() called outside of critical_enter()

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Sun, 20 Jan 2013 12:29:08 +0200
On Fri, Jan 18, 2013 at 11:56:27AM -0500, Ryan Stone wrote:
> I have been experiencing occasional deadlocks on FreeBSD 8.2 systems using
> the ULE scheduler.  The root cause in every case has been that ULE's
> TDQ_LOCK for cpu 0 is owned by a thread that is not running.  I have been
> investigating the issue, and I believe that I see the issue.  The problem
> occurs if the interrupt that drives statclock does not call
> critical_enter() upon calling into statclock().  The lapic timer does use
> critical_enter(), so default configurations would not see this.  I have
> local patches to use the RTC to drive statclock, and from a quick reading
> of the eventtimer code in -CURRENT the same thing is possible there.  The
> RTC code does not call statclock within a critical section.  So here's the
> bug:
> 
> 1) A thread with interrupts enabled, running on CPU 0, with td_owepreempt=1
> and td_critnest=1 calls critical_exit():
> 
> void
> critical_exit(void)
> {
>    // ...
>     if (td->td_critnest == 1) {
>         td->td_critnest = 0;
>         if (td->td_owepreempt && !kdb_active) {
> // Irrelevant bits snipped
> 
> 2) td_critnest is set to 0, and then the RTC interrupt fires.
> 
> 3) rtcintr calls into statclock (8.2) or statclock_cnt(head) with
> td_critnest still 0 (on head it goes through the eventtimer code, but it
> ends up in statclock eventually).
> 
> 4) statclock takes the thread_lock on curthread, then calls sched_clock().
> sched_clock calls sched_balance();
> 
> static void
> sched_balance(void)
> {
>     // snip...
>     tdq = TDQ_SELF();
>     TDQ_UNLOCK(tdq);
>     sched_balance_group(cpu_top);
>     TDQ_LOCK(tdq);
> }
I think that current code deserves at least CRITICAL_ASSERT() on
the entry to sched_balance(). The same assert should be added to
kern_clocksource.c::timercb().

> 
> TDQ_UNLOCK does a spinlock_exit which does a critical_exit.  td_critnest
> will be decremented back to 0 and td_owepreempt is still 1, so this
> triggers a preemption.  Note that this TDQ_UNLOCK is (intentionally)
> unlocking the thread_lock done by statclock.
> 
> 5) thread migrates to any other CPU, call it CPU n.  tdq is now stale.
> TDQ_LOCK takes the lock for CPU 0 (but really it's intending to re-take the
> thread_lock, although a thread_lock() here would be equally incorrect --
> sched_balance's caller is going to be mucking around with the TDQ when
> sched_balance returns).
> 
> 6) The thread returns to statclock.  statclock does a thread_unlock(). The
> td_lock is TDQ_LOCK(n), which we don't hold.  We mangle the stat of
> TDQ_LOCK(n) and leave TDQ_LOCK(0) held.
> 
> 
> The simplest solution would be to do a critical_enter() in sched_balance,
> although that would be superfluous in the normal case where the lapic timer
> is driving statclock.  I'm not sure if there's other code in the
> eventtimers infrastructure that's assuming that a preemption or migration
> is impossible while handling an event.  A quick look at kern_clocksource.c
> turns up worrying comments like "Handle all events for specified time on
> this CPU" and uses of curcpu, so there may well be other issues lurking
> here.
> 
> It looks to me that the safest thing to do would be to push the
> critical_enter() into the eventtimer code or even all the way back to the
> interrupt handlers (mirroring what the lapic code already does).

Both atrtc and hpet register the interrupt handler as the filter.
The filters call loop enters critical section around handlers, see
kern_intr.c:intr_event_handle(). At least on HEAD it is so, and I see
the same code in the 8.

Received on Sun Jan 20 2013 - 09:29:15 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:34 UTC