On Wed, 2 Jan 2013, Alexander Motin wrote: > On 02.01.2013 19:09, Konstantin Belousov wrote: >> On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote: >>> Probably one way to close this discussion would be to provide >>> a sysctl so the sysadmin can decide which point in the interval >>> to pick when there is no suitable callout already scheduled. >> Isn't trying to synchronize to the external events in this way unsafe ? >> I remember, but cannot find the reference right now, a scheduler >> exploit(s) which completely hide malicious thread from the time >> accounting, by making it voluntary yielding right before statclock >> should fire. If statistic gathering could be piggy-backed on the >> external interrupt, and attacker can control the source of the external >> events, wouldn't this give her a handle ? Fine-grained timeouts complete fully opening this security hole. Synchronization without fine-grained timeouts might allow the same, but is harder to exploit since you can't control the yielding points directly. With fine-grained timeouts, you just have to predict the statclock firing points. Use one timeout to arrange to yield just before statclock fires and another to regain control just after it has fired. If the timeout resolution is say 50 usec, then this can hope to run for all except 100 usec out of every 1/stathz seconds. With stathz = 128, 1/stathz is 7812 usec, so this gives 7712/7812 of the CPU with 0 statclock ticks. Since the scheduler never sees you running, your priority remains minimal, so the scheduler should prefer to run you whenever a timeout expires, with only round-robin with other minimal-priority threads preventing you getting 7712/7812 of the (user non-rtprio) CPU. The previous stage of fully opening this security hole was changing (the default) HZ from 100 to 1000. HZ must not be much smaller than stathz, else the security hole is almost fully open. With HZ = 100 being less than stathz and timeout granularity limiting the fine control to 2/HZ = 20 msec (except you can use a periodic itimer to get a 1/HZ granularity at a minor cost of getting more SIGALRMs), it is impossible to get near 100% of the CPU with 0 statclock ticks. After yielding, you can't get control for another 100 or 200 msec. Since this exceeds 1/stathz = 78.12 usec, you can only hide from statclock ticks by not running very often or for very long. Limited hiding is possible by wasting even more CPU to determine when to hide: since the timeout granularity is large, it is also ineffective for determining when to yield. So when running, you must poll the current time a lot to determine when to yield. Yield just before statclock fires, as above. (Do it 50 usec early, as above, to avoid most races involving polling the time.) This actually has good chances of not limiting the hiding too much, depending on the details of the scheduling. It yields just before a statclock tick. After this tick fires, if the scheduler reschedules for any reason, then the hiding process would most likely be run again, since its priority is minimal. But at least the old 4BSD scheduler doesn't reschedule after _every_ statclock tick. This depends on the bugfeature that the priority is not checked on _every_ return to user mode (sched_clock() does change the priority, but this is not acted on until much later). Without this bugfeature, there would be excessive context switches. OTOH, with timeouts, at least old non-fine-grained ones, you can force a rescheduling that is acted on soon enough simply by using timeouts (since timeouts give a context switch to the softclock thread, the scheduler has no option to skip checking the priority on return to user mode). After the previous stage of changing HZ to 1000, the granuarity is fine enough for using timeouts to hide from the scheduler. Using a periodic itimer to get a granularity of 1000 usec, start hiding 50-1000 usec before each statclock tick and regain control 1000 usec later. With stathz = 128, 6812/7812 of the CPU with 0 statclock ticks. Not much worse (for the hider) than 7712/7812. Statclock was supposed to be aperiodic to avoid hiding (see statclk-usenix93.ps), but this was never implemented in FreeBSD. With fine-grained timeouts, it would have to be very aperiodic, to the point of giving large inaccuracies, to limit the hiding very much. For example, suppose that it has an average period of 7812 usec with +-50% jitter. You would try to hide from it most of the time by running for a bit less than 7812/2 usec before yielding in most cases. If too much scheduling is done on each statclock tick, then you are likely to regain control after each one (as above) and then know that there is almost a full minimal period until the next one. Otherwise, it seems to be necessary to determine when the previous statclock tick occurred, so as to determine the minimum time until the next one. > There are many different kinds of accounting with different characteristics. > Run time for each thread calculated using high resolution per-CPU clocks on > each context switch. It is impossible to hide from it. But this (td_runtime) is not used at all for scheduling, since the high-resolution per-CPU clocks didn't exist originally and now since reading of them is expensive. However, the read is done on every context switch. This pessimizes context switches, with the pessimization reduced a little by using inaccurate cpu_ticks() timers instead of timecounters, so the expenses for using td_runtime for scheduling are mostly already paid for. The main case where it doesn't work is when there are few context switches, but there are usually more than a few for interrupts to ithreads for network and disk i/o. Updating td_runtime in statclock() (where it is not normally updated since there is no context switch) would probably make td_runtime usable for scheduling without increasing the pessimization much. > System load average > updated using callout and aligned with hardclock(). Hiding from it was easy > before, but it can be made more complicated (asynchronous) now. I think you mean it was easy because the timeout period is so long. Any aperiodicity in a timer of course means that it is harder to predict. The load average timeout was already randomized (I think it is 5 +- 2 seconds with 1/hz granularity). That is a large variance, but you can still hide from it about 3/5 of the time. loadav is not very important, and is not even used by SCHED_ULE. SCHED_ULE uses more fine-grained statistics based on statclock. > Per-CPU > SYSTEM/INTERRUPT/USER/NICE/IDLE counters are updated by statcklock(), that is > asynchronous to hardclock() and hiding from it supposed to be more > complicated. But even before it was possible, since hardclock() frequency is > 8 times higher then statclock() one. These are not used for scheduling. In SCHED_4BSD, little more than the event of a statclock tick is used for scheduling. sched_clock() uses this to update td_estcpu and then the priority but not much more. In SCHED_ULE, sched_clock() updates many more private variables. There are also statistics for memory use and similar things maintained by statclock(). These are not very important, like loadav for SCHED_ULE (purely information, for userland). I don't see s better way than _periodic_ statclock ticks for maintaining these. Aperiodicity just makes them less accurate for the usual case where there is no hiding (unless this is not the usual case, due to accidental synchronization causing non-malicious hiding, and if this is a problem then it can be fixed using only a small amount of aperiodicity). > More important for scheduling fairness > thread's CPU percentage is also based on hardclock() and hiding from it was > trivial before, since all sleep primitives were strictly aligned to > hardclock(). Now it is slightly less trivial, since this alignment was > removed and user-level APIs provide no easy way to enforce it. %cpu is actually based on statclock(), and not even used for scheduling. The alignment of sleep primitives mainly increased the chances of accidental synchronization. (This was apparently a problem when all clocks were derived from the lapic timer. I didn't like the fixes for that.) For malicious hiding, it only makes the hiding less trivial for hardclock ticks to be periodic and somewhat aligned with statclock ticks. The phase will drift, so that there is no long-term alignment, unless the clocks are derived for the same one (and not randomized). So the statclock firings will sometimes be mispredicted, or the hiding would have to be conservative, or the prediction updated a lot. It is simplest and probably most malicious to accept occasional mispredictions and recalibrate the prediction after missing. This is any easy case of relcalibrating often for highly aperiodic statclocks. > The only way to get really safe accounting is forget about sampling and use > potentially more expensive other ways. It was always stopped by lack of cheap > and reliable clocks. But since TSC is P-state invariant on most of CPUs > present now it could probably be reconsidered. And even if the clock is expensive and unreliable, it is already used for td_runtime, as described above. Large unreliability is actually less of a problem for scheduling than for td_runtime -- an error of 50% would barely affect scheduling since scheduling is only heuristic, but if td_runtime is off by 50% (as it often is without P-state invariance and with longstanding bugs in cpu_ticks() calibration), then it gives nonsense like user times exceeding real times by 50%. When I last thought about using reliable clocks for statistics, I was most interested in simplifiying things by removing all the the tick counters and not in making schedulers fairer. I still can't see how to do the former. To do the user/sys/interrupt accounting accurately, the clock would have to be read on every syscall entry and exit, and that is too much (it is bad enough that it is read on almost every interrupt, without even making interrupt accounting actually work right). But for scheduling decisions, a fuzzy copy of the full td_runtime is enough, at least for SCHED_4BSD. BruceReceived on Thu Jan 03 2013 - 13:45:55 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:33 UTC