On Thu, Jul 07, 2011 at 10:27:53AM +0300, Andriy Gapon wrote: > on 06/07/2011 21:11 Nathan Whitehorn said the following: > > On 07/06/11 13:00, Steve Kargl wrote: > >> AFAICT, it is a cpu affinity issue. If I launch n+1 MPI images > >> on a system with n cpus/cores, then 2 (and sometimes 3) images > >> are stuck on a cpu and those 2 (or 3) images ping-pong on that > >> cpu. I recall trying to use renice(8) to force some load > >> balancing, but vaguely remember that it did not help. > > > > I've seen exactly this problem with multi-threaded math libraries, as well. > > Exactly the same? Let's see. > > > Using parallel GotoBLAS on FreeBSD gives terrible performance because the > > threads keep migrating between CPUs, causing frequent cache misses. > > So Steve reports that if he has Nthr > Ncpu, then some threads are "over-glued" > to a particular CPU, which results in sub-optimal scheduling for those threads. > I have to guess that Steve would want to see the threads being shuffled between > CPUs to produce more even CPU load. I'm using OpenMPI. These are N > Ncpu processes not threads, and without the loss of generality let N = Ncpu + 1. It is a classic master-slave situation where 1 process initializes all others. The n-1 slave processes are then independent of each other. After 20 minutes or so of number crunching, each slave sends a few 10s of KB of data to the master. The master collects all the data, writes it to disk, and then sends the slaves the next set of computations to do. The computations are nearly identical, so each slave finishes it task in the same amount of time. The problem appears to be that 2 slaves are bound to the same cpu and the remaining N - 3 slaves are bound to a specific cpu. The N - 3 slaves finish their task, send data to the master, and then spin (chewing up nearly 100% cpu) waiting for the 2 ping-ponging slaves to finishes. This causes a stall in the computation. When a complete computation takes days to complete, theses stall become problematic. So, yes, I want the processes to get a more uniform access to cpus via migration to other cpus. This is what 4BSD appears to do. > On the other hand, you report that your threads keep being shuffled between CPUs > (I presume for Nthr == Ncpu case, where Nthr is a count of the number-crunching > threads). And I guess that you want them to stay glued to particular CPUs. > > So how is this the same problem? In fact, it sounds like somewhat opposite. > The only thing in common is that you both don't like how ULE works. Well, it may be similar in that N - 2 threads are bound to N - 2 cpus, and the remaining 2 threads are ping ponging on the last remaining cpu. I suspect that GotoBLAS has a large amount communication between threads, and once again the computations stalls waiting of the 2 threads to either finish battling for the 1 cpu or perhaps the process uses pthread_yield() in some clever way to try to get load balancing. -- SteveReceived on Thu Jul 07 2011 - 13:14:41 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:15 UTC