Re: Is kern.sched.preempt_thresh=0 a sensible default?

From: Stefan Esser <se_at_freebsd.org> Date: Sat, 9 Jun 2018 13:53:48 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:16 UTC

Am 07.06.18 um 19:14 schrieb Andriy Gapon:
> On 03/05/2018 12:41, Andriy Gapon wrote:
>> I think that we need preemption policies that might not be expressible as one or
>> two numbers.  A policy could be something like this:
>> - interrupt threads can preempt only threads from "lower" classes: real-time,
>> kernel, timeshare, idle;
>> - interrupt threads cannot preempt other interrupt threads
>> - real-time threads can preempt other real-time threads and threads from "lower"
>> classes: kernel, timeshare, idle
>> - kernel threads can preempt only threads from lower classes: timeshare, idle
>> - interactive timeshare threads can only preempt batch and idle threads
>> - batch threads can only preempt idle threads
> 
> Here is a sketch of the idea: https://reviews.freebsd.org/D15693

Hi Andriy,

I highly appreciate your effort to improve the scheduling in SCHED_ULE.

But I'm afraid, that your scheme will not fix the problem. As you may
know, there are a number of problems with SCHED_ULE, which let quite a
number of users prefer SCHED_4BSD even on multi-core systems.

The problems I'm aware of:

1) On UP systems, I/O intensive applications may be starved by compute
   intensive processes that are allowed to consume their full quantum of
   time (limiting reads to some 10 per second worst case).

2) Similarly, on SMP systems with load higher than the number of cores
   (virtual cores in case of HT), the compute bound cores can slow down
   a cp of a large file from 100s of MB/s to 100s of KB/s, under certain
   circumstances.

3) Programs that evenly split the load on all available cores have been
   suffering from sub-optimal assignment of threads to cores. E.g. on a
   CPU with 8 (virtual) cores, this resulted in 6 cores running the load
   in nominal time, 1 core taking twice as long because 2 threads were
   scheduled to run on it, while 1 core was mostly idle. Even if the
   load was initially evenly distributed, a woken up process that ran on
   one core destroyed the symmetry and it was not recovered. (This was a
   problem e.g. for parallel programs using MPI or the like.)

4) The real time behavior of SCHED_ULE is weak due to interactive
   processes (e.g. the X server) being put into the "time-share" class
   and then suffering from the problems described as 1) or 2) above.
   (You distinguish time-share and batch processes, which both are
    allowed to consume their full quanta even of a higher priority
    process in their class becomes runnable. I think this will not
    give the required responsiveness e.g. for an X server.)
   They should be considered I/O intensive, if they often don't use
   their full quantum, without taking the significant amount of CPU
   time they may use at times into account. (I.e. the criterion for
   time-sharing should not be the CPU time consumed, but rather some
   fraction of the quanta not being fully used due to voluntarily giving
   up the CPU.) With many real-time threads it may be hard to identify
   interactive threads, since they are non-voluntarily disrupted too
   often - this must be considered in the sampling of voluntary vs.
   non-voluntary context switches.

5) The NICE parameter has hardly any effect on the scheduling. Processes
   started with nice 19 get nearly the same share of the CPU as processes
   at nice 0, while they should traditionally only run when a core was
   idle, otherwise. Nice values between 0 and 19 have even less effect
   (hardly any).

I have not had time to try the patch in that review, but I think that
the cause of scheduling problems is not localized in that function.

And a solution should be based on typical use cases or sample scenarios
being applied to a scheduling policy. There are some easy cases (e.g. a
"random" load of independent processes like a parallel make run), where
only cache effects are relevant (try to keep a thread on its CPU as long
as possible and, if interrupted, continue it on that CPU if you can assume
there is still significant cached state).

There have been excessive KTR traces that showed the scheduler behavior
under specific loads, especially MPI, and there have been attempts to
fix the uneven distribution of processes for that case (but AFAIR not
with good success).

Your patches may be part of the solution, with at least 3 other parts
remaining:

1) The classification of interactive and time-share should be separate.
   Interactive means that the process does not use its full quantum in
   a non-negligible fraction of cases. The X server or a DBMS server
   should not be considered compute intensive, or request rates will
   be as low as 10 per second (if the time-share quantum is in the
   order of 100 ms).

2) The scheduling should guarantee symmetric distribution of the load
   for scenarios as parallel programs with MPI. Since OpenMP and other
   mechanism have similar requirements, this will become more relevant
   over time.

3) The nice-ness of a process should be relevant, to give the user or
   admin a way to adjust priorities.

Each of those points will require changes in different parts of the
scheduler, but I think those changes should not be considered in
isolation.

Best regards, STefan