On Thursday, August 26, 2010 4:03:38 pm mdf_at_freebsd.org wrote: > Back at the beginning of August I posted about issues with sched_pin() > and witness_warn(): > > http://lists.freebsd.org/pipermail/freebsd-hackers/2010-August/032553.html > > After a lot of debugging I think I've basically found the issue. I > found this bug on stable/7, but looking at the commit log for CURRENT > I don't see any reason the bug doesn't exist there. This analysis is > specific to amd64/i386 but the problem is likely to exist on most > arches. > > The basic problem is that in both tdq_move() and sched_setcpu() can > manipulate the ts->ts_cpu variable of another process, which may be > running. This means that a process running on CPU N may be set to run > on CPU M the next context switch. > > Then, that process may call sched_pin(), then grab a PCPU variable. > An IPI_PREEMPT can come in, causing the thread to call mi_switch() in > ipi_bitmap_handler(). sched_switch() will then notice that ts->ts_cpu > is not PCPU_GET(cpuid), and call sched_switch_migrate(), migrating the > thread to the intended CPU. Thus after sched_pin() and potentially > before using any PCPU variable the process can get migrated to another > CPU, and now it is using a PCPU variable for the wrong processor. > > Given that ts_cpu can be set by other threads, it doesn't seem worth > checking at sched_pin time whether ts_cpu is not the same as td_oncpu, > because to make the check would require taking the thread_lock. > Instead, I propose adding a check for !THREAD_CAN_MIGRATE() before > calling sched_switch_migrate(). However, sched_pin() is also used by > sched_bind(9) to force the thread to stay on the intended cpu. I > would handle this by adding a TSF_BINDING state that is different from > TSF_BOUND. > > The thing that has me wondering whether this is all correct is that I > don't see any checks in tdq_move() or sched_setcpu() for either > TSF_BOUND or THREAD_CAN_MIGRATE(). Perhaps that logic is managed in > the various calling functions; in that case I would propose adding > asserts that the thread is THREAD_CAN_MIGRATE() unless it's marked > TSF_BINDING. > > Does this analysis seem correct? Calling code does check THREAD_CAN_MIGRATE(), but the problem is that it is not safe to call that on anything but curthread as td_pinned is not locked. It is assumed that only curthread would ever check td_pinned, not other threads. I suspect what is happening is that another CPU is seeing a stale value of td_pinned. You could fix this by grabbing the thread lock in sched_pin()/unpin() for ULE, but that is a bit expensive perhaps. You could test your theory by disabling thread stealing btw. There are a few sysctl knobs to turn it off. -- John BaldwinReceived on Thu Aug 26 2010 - 18:49:31 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:06 UTC