Re: sched_pin() bug in SCHED_ULE

From: John Baldwin <jhb_at_freebsd.org>
Date: Thu, 26 Aug 2010 16:49:20 -0400
On Thursday, August 26, 2010 4:03:38 pm mdf_at_freebsd.org wrote:
> Back at the beginning of August I posted about issues with sched_pin()
> and witness_warn():
> 
> http://lists.freebsd.org/pipermail/freebsd-hackers/2010-August/032553.html
> 
> After a lot of debugging I think I've basically found the issue.  I
> found this bug on stable/7, but looking at the commit log for CURRENT
> I don't see any reason the bug doesn't exist there.  This analysis is
> specific to amd64/i386 but the problem is likely to exist on most
> arches.
> 
> The basic problem is that in both tdq_move() and sched_setcpu() can
> manipulate the ts->ts_cpu variable of another process, which may be
> running.  This means that a process running on CPU N may be set to run
> on CPU M the next context switch.
> 
> Then, that process may call sched_pin(), then grab a PCPU variable.
> An IPI_PREEMPT can come in, causing the thread to call mi_switch() in
> ipi_bitmap_handler().  sched_switch() will then notice that ts->ts_cpu
> is not PCPU_GET(cpuid), and call sched_switch_migrate(), migrating the
> thread to the intended CPU.  Thus after sched_pin() and potentially
> before using any PCPU variable the process can get migrated to another
> CPU, and now it is using a PCPU variable for the wrong processor.
> 
> Given that ts_cpu can be set by other threads, it doesn't seem worth
> checking at sched_pin time whether ts_cpu is not the same as td_oncpu,
> because to make the check would require taking the thread_lock.
> Instead, I propose adding a check for !THREAD_CAN_MIGRATE() before
> calling sched_switch_migrate().  However, sched_pin() is also used by
> sched_bind(9) to force the thread to stay on the intended cpu.  I
> would handle this by adding a TSF_BINDING state that is different from
> TSF_BOUND.
> 
> The thing that has me wondering whether this is all correct is that I
> don't see any checks in tdq_move() or sched_setcpu() for either
> TSF_BOUND or THREAD_CAN_MIGRATE().  Perhaps that logic is managed in
> the various calling functions; in that case I would propose adding
> asserts that the thread is THREAD_CAN_MIGRATE() unless it's marked
> TSF_BINDING.
> 
> Does this analysis seem correct?

Calling code does check THREAD_CAN_MIGRATE(), but the problem is that it is 
not safe to call that on anything but curthread as td_pinned is not locked.  
It is assumed that only curthread would ever check td_pinned, not other 
threads.  I suspect what is happening is that another CPU is seeing a stale 
value of td_pinned.  You could fix this by grabbing the thread lock in 
sched_pin()/unpin() for ULE, but that is a bit expensive perhaps.

You could test your theory by disabling thread stealing btw.  There are a few 
sysctl knobs to turn it off.

-- 
John Baldwin
Received on Thu Aug 26 2010 - 18:49:31 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:06 UTC