Re: Heavy I/O blocks FreeBSD box for several seconds

From: Steve Kargl <sgk_at_troutmask.apl.washington.edu> Date: Thu, 7 Jul 2011 08:14:40 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:15 UTC

On Thu, Jul 07, 2011 at 10:27:53AM +0300, Andriy Gapon wrote:
> on 06/07/2011 21:11 Nathan Whitehorn said the following:
> > On 07/06/11 13:00, Steve Kargl wrote:
> >> AFAICT, it is a cpu affinity issue.  If I launch n+1 MPI images
> >> on a system with n cpus/cores, then 2 (and sometimes 3) images
> >> are stuck on a cpu and those 2 (or 3) images ping-pong on that
> >> cpu.  I recall trying to use renice(8) to force some load
> >> balancing, but vaguely remember that it did not help.
> > 
> > I've seen exactly this problem with multi-threaded math libraries, as well.
> 
> Exactly the same?  Let's see.
> 
> > Using parallel GotoBLAS on FreeBSD gives terrible performance because the
> > threads keep migrating between CPUs, causing frequent cache misses.
> 
> So Steve reports that if he has Nthr > Ncpu, then some threads are "over-glued"
> to a particular CPU, which results in sub-optimal scheduling for those threads.
>  I have to guess that Steve would want to see the threads being shuffled between
> CPUs to produce more even CPU load.

I'm using OpenMPI.  These are N > Ncpu processes not threads, and without
the loss of generality let N = Ncpu + 1.  It is a classic master-slave
situation where 1 process initializes all others.  The n-1 slave processes
are then independent of each other.  After 20 minutes or so of number
crunching, each slave sends a few 10s of KB of data to the master.  The
master collects all the data, writes it to disk, and then sends the
slaves the next set of computations to do.  The computations are nearly 
identical, so each slave finishes it task in the same amount of time. The
problem appears to be that 2 slaves are bound to the same cpu and the 
remaining N - 3 slaves are bound to a specific cpu.  The N - 3 slaves
finish their task, send data to the master, and then spin (chewing up
nearly 100% cpu) waiting for the 2 ping-ponging slaves to finishes.
This causes a stall in the computation.  When a complete computation
takes days to complete, theses stall become problematic.  So, yes, I 
want the processes to get a more uniform access to cpus via migration
to other cpus.  This is what 4BSD appears to do.

> On the other hand, you report that your threads keep being shuffled between CPUs
> (I presume for Nthr == Ncpu case, where Nthr is a count of the number-crunching
> threads).  And I guess that you want them to stay glued to particular CPUs.
> 
> So how is this the same problem?  In fact, it sounds like somewhat opposite.
> The only thing in common is that you both don't like how ULE works.

Well, it may be similar in that N - 2 threads are bound to N - 2
cpus, and the remaining 2 threads are ping ponging on the last 
remaining cpu.  I suspect that GotoBLAS has a large amount 
communication between threads, and once again the computations
stalls waiting of the 2 threads to either finish battling for the
1 cpu or perhaps the process uses pthread_yield() in some clever
way to try to get load balancing.

-- 
Steve