Re: SCHED_ULE should not be the default

From: Steve Kargl <sgk_at_troutmask.apl.washington.edu> Date: Mon, 12 Dec 2011 09:06:04 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:21 UTC

On Mon, Dec 12, 2011 at 04:18:35PM +0000, Bruce Cran wrote:
> On 12/12/2011 15:51, Steve Kargl wrote:
> >This comes up every 9 months or so, and must be approaching FAQ 
> >status. In a HPC environment, I recommend 4BSD. Depending on the 
> >workload, ULE can cause a severe increase in turn around time when 
> >doing already long computations. If you have an MPI application, 
> >simply launching greater than ncpu+1 jobs can show the problem. PS: 
> >search the list archives for "kargl and ULE". 
> 
> This isn't something that can be fixed by tuning ULE? For example for 
> desktop applications kern.sched.preempt_thresh should be set to 224 from 
> its default. I'm wondering if the installer should ask people what the 
> typical use will be, and tune the scheduler appropriately.
> 

Tuning kern.sched.preempt_thresh did not seem to help for
my workload.  My code is a classic master-slave OpenMPI
application where the master runs on one node and all
cpu-bound slaves are sent to a second node.  If I send
send ncpu+1 jobs to the 2nd node with ncpu's, then 
ncpu-1 jobs are assigned to the 1st ncpu-1 cpus.  The
last two jobs are assigned to the ncpu'th cpu, and 
these ping-pong on the this cpu.  AFAICT, it is a cpu
affinity issue, where ULE is trying to keep each job
associated with its initially assigned cpu.

While one might suggest that starting ncpu+1 jobs
is not prudent, my example is just that.  It is an
example showing that ULE has performance issues. 
So, I now can start only ncpu jobs on each node
in the cluster and send emails to all other users
to not use those node, or use 4BSD and not worry
about loading issues.

-- 
Steve