Re: Heavy I/O blocks FreeBSD box for several seconds

From: Steve Kargl <sgk_at_troutmask.apl.washington.edu> Date: Thu, 7 Jul 2011 13:08:45 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:15 UTC

On Thu, Jul 07, 2011 at 10:42:39PM +0300, Andriy Gapon wrote:
> on 07/07/2011 18:14 Steve Kargl said the following:
>> 
>> I'm using OpenMPI.  These are N > Ncpu processes not threads,
>
> I used 'thread' in a sense of a kernel thread.  It shouldn't
> actually matter if it's a process or a thread in userland
> in this context.
> 
> > and without
> > the loss of generality let N = Ncpu + 1.  It is a classic master-slave
> > situation where 1 process initializes all others.  The n-1 slave processes
> > are then independent of each other.  After 20 minutes or so of number
> > crunching, each slave sends a few 10s of KB of data to the master.  The
> > master collects all the data, writes it to disk, and then sends the
> > slaves the next set of computations to do.  The computations are nearly 
> > identical, so each slave finishes it task in the same amount of time. The
> > problem appears to be that 2 slaves are bound to the same cpu and the 
> > remaining N - 3 slaves are bound to a specific cpu.  The N - 3 slaves
> > finish their task, send data to the master, and then spin (chewing up
> > nearly 100% cpu) waiting for the 2 ping-ponging slaves to finishes.
> > This causes a stall in the computation.  When a complete computation
> > takes days to complete, theses stall become problematic.  So, yes, I 
> > want the processes to get a more uniform access to cpus via migration
> > to other cpus.  This is what 4BSD appears to do.
> 
> I would imagine that periodic rebalancing would take care of this,
> but probably the ULE rebalancing algorithm is not perfect.

:-)

> There was a suggestion on performance_at_ to try to use a lower value for
> kern.sched.steal_thresh, a value of 1 was recommended:
> http://article.gmane.org/gmane.os.freebsd.performance/3459

node16:kargl[215] uname -a
FreeBSD node16.cimu.org 9.0-CURRENT FreeBSD 9.0-CURRENT #2 r223824M:
Thu Jul  7 11:12:15 PDT 2011 

node16:kargl[216] sysctl -a | grep smp.cpu
kern.smp.cpus: 4

4BSD kernel gives for N = Ncpu.

33 processes:  5 running, 28 sleeping

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
 1387 kargl       1  67    0   370M   293M CPU1    1   1:31 98.34% sasmp
 1384 kargl       1  67    0   370M   293M CPU2    2   1:31 98.34% sasmp
 1386 kargl       1  67    0   370M   294M CPU3    3   1:30 98.34% sasmp
 1385 kargl       1  67    0   370M   294M RUN     0   1:31 98.29% sasmp

4BSD kernel gives for N = Ncpu + 1.

34 processes:  6 running, 28 sleeping

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
 1417 kargl       1  71    0   370M   294M RUN     0   1:30 79.39% sasmp
 1416 kargl       1  71    0   370M   294M RUN     0   1:30 79.20% sasmp
 1418 kargl       1  71    0   370M   294M CPU2    0   1:29 78.81% sasmp
 1420 kargl       1  71    0   370M   294M CPU1    2   1:30 78.27% sasmp
 1419 kargl       1  70    0   370M   294M CPU3    0   1:30 77.59% sasmp

Recompiling the kernel to use ULE instead of 4BSD with the exact same
hardware and kernel configuration.

ULE kernel gives for N = Ncpu.

33 processes:  5 running, 28 sleeping

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
 1294 kargl       1 103    0   370M   294M CPU3    3   1:30 100.00% sasmp
 1292 kargl       1 103    0   370M   294M RUN     2   1:30 100.00% sasmp
 1295 kargl       1 103    0   370M   293M CPU0    0   1:30 100.00% sasmp
 1293 kargl       1 103    0   370M   294M CPU1    1   1:28 100.00% sasmp

ULE kernel gives for N = Ncpu + 1.

34 processes:  6 running, 28 sleeping

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
 1318 kargl       1 103    0   370M   294M CPU0    0   1:31 100.00% sasmp
 1319 kargl       1 103    0   370M   294M RUN     1   1:29 100.00% sasmp
 1322 kargl       1  99    0   370M   294M CPU2    2   1:03 87.26% sasmp
 1320 kargl       1  91    0   370M   294M RUN     3   1:07 60.79% sasmp
 1321 kargl       1  89    0   370M   294M CPU3    3   1:06 55.18% sasmp

node16:root[165] sysctl -w kern.sched.steal_thresh=1
kern.sched.steal_thresh: 2 -> 1

34 processes:  6 running, 28 sleeping

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 1396 kargl       1 103    0   366M   291M CPU3    3   1:30 100.00% sasmp
 1397 kargl       1 103    0   366M   291M CPU2    2   1:30 99.17% sasmp
 1400 kargl       1  97    0   366M   291M CPU0    0   1:05 83.25% sasmp
 1399 kargl       1  94    0   366M   291M RUN     1   1:04 73.97% sasmp
 1398 kargl       1  98    0   366M   291M RUN     0   1:01 54.05% sasmp

-- 
Steve