Is kern.sched.preempt_thresh=0 a sensible default?

From: Andriy Gapon <avg_at_FreeBSD.org> Date: Thu, 3 May 2018 12:41:01 +0300 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC

On 05/04/2018 15:31, Stefan Esser wrote:
> After looking at sched_ule.c and top/machine.c it appears, that the value
> of preempt_thresh corresponds to the PRI value as shown by top (or ps -l)
> plus PZERO which is calculated as (PRI_MIN_KERN=80) + 20.

Kernel defines priorities from zero to 255.
top shows the same priorities with 100 subtracted.
At least that's how I look at it.
I think we said the same thing but in different words.

> What I do not understand, though, is that the decision about a preemption
> is only based on the calculated new priority of the thread, but not at all
> on the priority of other running threads (except the idle thread).

I don't understand this statement.  A new thread to run is picked up based on
priorities of all runnable threads.  The preemption decision does take into
account the priorities of the currently running thread as well as the new thread.

> On my system, a "real" batch job (i.e. one that does not voluntarily give
> up the CPU due to I/O) seems to have a PRI value of 80 to 100 (growing
> over time), while an interactive process has a PRI of 20, a maximally
> "niced" interactive process has 52.
> 
> So, I'd expect a reasonable default value of preempt_thresh to be slightly
> above 120 (e.g. 124) to prevent I/O heavy threads from stealing each other
> the CPU too often, and to prevent "niced" processes from doing the same ...
> 
> The two values configured into the kernel (80 for PREEMPTION and 255 for
> FULL_PREEMPTION) seem to be extremes, but something in between (e.g. 124)
> is not offered (can only be configured via sysctl without any information
> for the correspondence between the threshold value and the PRI value in
> any document I've found, besides the kernel sources ...).
> 
> 
> Is PRI_MIN_KERN=80 really a good default value for the preemption threshold?

Yeah, a good question...
I am not really sure about this.  In my opinion it would be better to set
preempt_thresh to at least PRI_MAX_KERN, so that all threads running in kernel
are allowed to preempt userland threads.  But that would also allow kernel
threads (with priorities between PRI_MIN_KERN and PRI_MAX_KERN) to preempt other
kernel threads as well, not sure if that's always okay.  The same argument
applies to higher values for preempt_thresh as well.

Perhaps a single preempt_thresh is not expressive enough?
Just a thought... maybe we need two thresholds where one tells that threads with
better priority are potentially allowed to preempt other threads and the other
tells that threads with worse priority can be preempted.
For example:
- may_preempt_prio=PRI_MAX_INTERACT
- may_be_preempted_prio=PRI_MIN_BATCH
This tells that realtime, kernel and interactive threads are allowed to preempt
other threads if other conditions are met.
And only batch and idle threads can actually be preempted.

Probably even the above is not flexible enough.
I think that we need preemption policies that might not be expressible as one or
two numbers.  A policy could be something like this:
- interrupt threads can preempt only threads from "lower" classes: real-time,
kernel, timeshare, idle;
- interrupt threads cannot preempt other interrupt threads
- real-time threads can preempt other real-time threads and threads from "lower"
classes: kernel, timeshare, idle
- kernel threads can preempt only threads from lower classes: timeshare, idle
- interactive timeshare threads can only preempt batch and idle threads
- batch threads can only preempt idle threads

-- 
Andriy Gapon