On Thu, Jul 7, 2011 at 5:14 PM, Steve Kargl < sgk_at_troutmask.apl.washington.edu> wrote: > On Thu, Jul 07, 2011 at 10:27:53AM +0300, Andriy Gapon wrote: > > on 06/07/2011 21:11 Nathan Whitehorn said the following: > > > On 07/06/11 13:00, Steve Kargl wrote: > > >> AFAICT, it is a cpu affinity issue. If I launch n+1 MPI images > > >> on a system with n cpus/cores, then 2 (and sometimes 3) images > > >> are stuck on a cpu and those 2 (or 3) images ping-pong on that > > >> cpu. I recall trying to use renice(8) to force some load > > >> balancing, but vaguely remember that it did not help. > > > > > > I've seen exactly this problem with multi-threaded math libraries, as > well. > > > > Exactly the same? Let's see. > > > > > Using parallel GotoBLAS on FreeBSD gives terrible performance because > the > > > threads keep migrating between CPUs, causing frequent cache misses. > > > > So Steve reports that if he has Nthr > Ncpu, then some threads are > "over-glued" > > to a particular CPU, which results in sub-optimal scheduling for those > threads. > > I have to guess that Steve would want to see the threads being shuffled > between > > CPUs to produce more even CPU load. > > I'm using OpenMPI. These are N > Ncpu processes not threads, and without > the loss of generality let N = Ncpu + 1. It is a classic master-slave > situation where 1 process initializes all others. The n-1 slave processes > are then independent of each other. After 20 minutes or so of number > crunching, each slave sends a few 10s of KB of data to the master. The > master collects all the data, writes it to disk, and then sends the > slaves the next set of computations to do. The computations are nearly > identical, so each slave finishes it task in the same amount of time. The > problem appears to be that 2 slaves are bound to the same cpu and the > remaining N - 3 slaves are bound to a specific cpu. The N - 3 slaves > finish their task, send data to the master, and then spin (chewing up > nearly 100% cpu) waiting for the 2 ping-ponging slaves to finishes. > This causes a stall in the computation. When a complete computation > takes days to complete, theses stall become problematic. So, yes, I > want the processes to get a more uniform access to cpus via migration > to other cpus. This is what 4BSD appears to do. > > Spinning threads are a PITA for any scheduler, it's just that in your case 4BSD computes quantums differently. Is there any way to make the software sleep instead of spinning? > > On the other hand, you report that your threads keep being shuffled > between CPUs > > (I presume for Nthr == Ncpu case, where Nthr is a count of the > number-crunching > > threads). And I guess that you want them to stay glued to particular > CPUs. > > > > So how is this the same problem? In fact, it sounds like somewhat > opposite. > > The only thing in common is that you both don't like how ULE works. > > Well, it may be similar in that N - 2 threads are bound to N - 2 > cpus, and the remaining 2 threads are ping ponging on the last > remaining cpu. I suspect that GotoBLAS has a large amount > communication between threads, and once again the computations > stalls waiting of the 2 threads to either finish battling for the > 1 cpu or perhaps the process uses pthread_yield() in some clever > way to try to get load balancing. > > -- > Steve > _______________________________________________ > freebsd-current_at_freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org" > -- Good, fast & cheap. Pick any two.Received on Thu Jul 07 2011 - 13:54:52 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:15 UTC