Re: [RFC/RFT] calloutng

From: Bruce Evans <brde_at_optusnet.com.au> Date: Fri, 4 Jan 2013 01:45:37 +1100 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:33 UTC

On Wed, 2 Jan 2013, Alexander Motin wrote:

> On 02.01.2013 19:09, Konstantin Belousov wrote:
>> On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote:
>>> Probably one way to close this discussion would be to provide
>>> a sysctl so the sysadmin can decide which point in the interval
>>> to pick when there is no suitable callout already scheduled.
>> Isn't trying to synchronize to the external events in this way unsafe ?
>> I remember, but cannot find the reference right now, a scheduler
>> exploit(s) which completely hide malicious thread from the time
>> accounting, by making it voluntary yielding right before statclock
>> should fire. If statistic gathering could be piggy-backed on the
>> external interrupt, and attacker can control the source of the external
>> events, wouldn't this give her a handle ?

Fine-grained timeouts complete fully opening this security hole.
Synchronization without fine-grained timeouts might allow the same,
but is harder to exploit since you can't control the yielding points
directly.  With fine-grained timeouts, you just have to predict the
statclock firing points.  Use one timeout to arrange to yield just
before statclock fires and another to regain control just after it has
fired.  If the timeout resolution is say 50 usec, then this can hope
to run for all except 100 usec out of every 1/stathz seconds.  With
stathz = 128, 1/stathz is 7812 usec, so this gives 7712/7812 of the
CPU with 0 statclock ticks.  Since the scheduler never sees you running,
your priority remains minimal, so the scheduler should prefer to run
you whenever a timeout expires, with only round-robin with other
minimal-priority threads preventing you getting 7712/7812 of the (user
non-rtprio) CPU.

The previous stage of fully opening this security hole was changing
(the default) HZ from 100 to 1000.  HZ must not be much smaller than
stathz, else the security hole is almost fully open.  With HZ = 100
being less than stathz and timeout granularity limiting the fine control
to 2/HZ = 20 msec (except you can use a periodic itimer to get a 1/HZ
granularity at a minor cost of getting more SIGALRMs), it is impossible
to get near 100% of the CPU with 0 statclock ticks.  After yielding,
you can't get control for another 100 or 200 msec.  Since this exceeds
1/stathz = 78.12 usec, you can only hide from statclock ticks by not
running very often or for very long.  Limited hiding is possible by
wasting even more CPU to determine when to hide: since the timeout
granularity is large, it is also ineffective for determining when to
yield.  So when running, you must poll the current time a lot to
determine when to yield.  Yield just before statclock fires, as above.
(Do it 50 usec early, as above, to avoid most races involving polling
the time.)  This actually has good chances of not limiting the hiding
too much, depending on the details of the scheduling.  It yields just
before a statclock tick.  After this tick fires, if the scheduler
reschedules for any reason, then the hiding process would most likely
be run again, since its priority is minimal.  But at least the old
4BSD scheduler doesn't reschedule after _every_ statclock tick.  This
depends on the bugfeature that the priority is not checked on _every_
return to user mode (sched_clock() does change the priority, but this
is not acted on until much later).  Without this bugfeature, there
would be excessive context switches.  OTOH, with timeouts, at least
old non-fine-grained ones, you can force a rescheduling that is acted
on soon enough simply by using timeouts (since timeouts give a context
switch to the softclock thread, the scheduler has no option to skip
checking the priority on return to user mode).

After the previous stage of changing HZ to 1000, the granuarity is fine
enough for using timeouts to hide from the scheduler.  Using a periodic
itimer to get a granularity of 1000 usec, start hiding 50-1000 usec
before each statclock tick and regain control 1000 usec later.  With
stathz = 128, 6812/7812 of the CPU with 0 statclock ticks.  Not much
worse (for the hider) than 7712/7812.

Statclock was supposed to be aperiodic to avoid hiding (see
statclk-usenix93.ps), but this was never implemented in FreeBSD.  With
fine-grained timeouts, it would have to be very aperiodic, to the point
of giving large inaccuracies, to limit the hiding very much.  For
example, suppose that it has an average period of 7812 usec with +-50%
jitter.  You would try to hide from it most of the time by running for
a bit less than 7812/2 usec before yielding in most cases.  If too
much scheduling is done on each statclock tick, then you are likely
to regain control after each one (as above) and then know that there
is almost a full minimal period until the next one.  Otherwise, it
seems to be necessary to determine when the previous statclock tick
occurred, so as to determine the minimum time until the next one.

> There are many different kinds of accounting with different characteristics. 
> Run time for each thread calculated using high resolution per-CPU clocks on 
> each context switch. It is impossible to hide from it.

But this (td_runtime) is not used at all for scheduling, since the
high-resolution per-CPU clocks didn't exist originally and now since
reading of them is expensive.  However, the read is done on every
context switch.  This pessimizes context switches, with the
pessimization reduced a little by using inaccurate cpu_ticks() timers
instead of timecounters, so the expenses for using td_runtime for
scheduling are mostly already paid for.  The main case where it doesn't
work is when there are few context switches, but there are usually
more than a few for interrupts to ithreads for network and disk i/o.
Updating td_runtime in statclock() (where it is not normally updated
since there is no context switch) would probably make td_runtime usable
for scheduling without increasing the pessimization much.

> System load average 
> updated using callout and aligned with hardclock(). Hiding from it was easy 
> before, but it can be made more complicated (asynchronous) now.

I think you mean it was easy because the timeout period is so long.  Any
aperiodicity in a timer of course means that it is harder to predict.  The
load average timeout was already randomized (I think it is 5 +- 2 seconds
with 1/hz granularity).  That is a large variance, but you can still hide
from it about 3/5 of the time.

loadav is not very important, and is not even used by SCHED_ULE.  SCHED_ULE
uses more fine-grained statistics based on statclock.

> Per-CPU 
> SYSTEM/INTERRUPT/USER/NICE/IDLE counters are updated by statcklock(), that is 
> asynchronous to hardclock() and hiding from it supposed to be more 
> complicated. But even before it was possible, since hardclock() frequency is 
> 8 times higher then statclock() one.

These are not used for scheduling.

In SCHED_4BSD, little more than the event of a statclock tick is used for
scheduling.  sched_clock() uses this to update td_estcpu and then the
priority but not much more.  In SCHED_ULE, sched_clock() updates many
more private variables.

There are also statistics for memory use and similar things maintained
by statclock().  These are not very important, like loadav for SCHED_ULE
(purely information, for userland).  I don't see s better way than
_periodic_ statclock ticks for maintaining these.  Aperiodicity just
makes them less accurate for the usual case where there is no hiding
(unless this is not the usual case, due to accidental synchronization
causing non-malicious hiding, and if this is a problem then it can be
fixed using only a small amount of aperiodicity).

> More important for scheduling fairness 
> thread's CPU percentage is also based on hardclock() and hiding from it was 
> trivial before, since all sleep primitives were strictly aligned to 
> hardclock(). Now it is slightly less trivial, since this alignment was 
> removed and user-level APIs provide no easy way to enforce it.

%cpu is actually based on statclock(), and not even used for scheduling.

The alignment of sleep primitives mainly increased the chances of accidental
synchronization.  (This was apparently a problem when all clocks were
derived from the lapic timer.  I didn't like the fixes for that.)  For
malicious hiding, it only makes the hiding less trivial for hardclock ticks
to be periodic and somewhat aligned with statclock ticks.  The phase will
drift, so that there is no long-term alignment, unless the clocks are
derived for the same one (and not randomized).  So the statclock firings
will sometimes be mispredicted, or the hiding would have to be conservative,
or the prediction updated a lot.  It is simplest and probably most malicious
to accept occasional mispredictions and recalibrate the prediction after
missing.  This is any easy case of relcalibrating often for highly aperiodic
statclocks.

> The only way to get really safe accounting is forget about sampling and use 
> potentially more expensive other ways. It was always stopped by lack of cheap 
> and reliable clocks. But since TSC is P-state invariant on most of CPUs 
> present now it could probably be reconsidered.

And even if the clock is expensive and unreliable, it is already used for
td_runtime, as described above.  Large unreliability is actually less of
a problem for scheduling than for td_runtime -- an error of 50% would
barely affect scheduling since scheduling is only heuristic, but if
td_runtime is off by 50% (as it often is without P-state invariance and
with longstanding bugs in cpu_ticks() calibration), then it gives nonsense
like user times exceeding real times by 50%.

When I last thought about using reliable clocks for statistics, I was
most interested in simplifiying things by removing all the the tick
counters and not in making schedulers fairer.  I still can't see how
to do the former.  To do the user/sys/interrupt accounting accurately,
the clock would have to be read on every syscall entry and exit, and
that is too much (it is bad enough that it is read on almost every
interrupt, without even making interrupt accounting actually work
right).  But for scheduling decisions, a fuzzy copy of the full
td_runtime is enough, at least for SCHED_4BSD.

Bruce