on 11/07/2011 19:16 Steve Kargl said the following: > On Mon, Jul 11, 2011 at 06:07:04PM +0300, Andriy Gapon wrote: >> But it's not clear which of the processes are slaves and which is master. >> It's also not clear why the master takes so much CPU (on par with the >> slaves) - >> from my reading of its description (by Steve) it should be doing only light >> periodic work. > > These are all slave processes. The master process was on a different > node in the cluster. Each process is doing the exact same computation > with only a small change in a coordinate from (x,y,z) to (x,y+n*dy,z) > with n = 1, 2, 3, 4. The small change does not causes a different > code path, so all should complete in nearly identical times. OK, the situation is much clearer (to me) now. >> If it does have to do CPU-heavy work, then I'd imagine that it should >> spawn only Ncpus - 1 slaves. > > And if you have M users on the system? Also note, you can get the > exact same loading problem by launching Ncpu+1 completely independent > cpu-bound processes. Ncpu-1 processes will be bound to specific cpus > and 2 processes will ping-pong on one cpu. This ping-ponging will > simply kill performance. I'd still argue that if someone cares about doing some calculations as fast as possible then he shouldn't have more than Ncpu CPU-bound processes. How to achieve that is a technical/administrative issue. But nevertheless I now see what the problem is. I think that the best thing you can further provide (as objective evidence for the problem at hand) is ktr(4) traces for at least KTR_SCHED mask. Perhaps you even already have them from your previous sessions with Jeff. P.S. This is not a promise to actually debug this issue based on the traces :-) -- Andriy GaponReceived on Tue Jul 12 2011 - 06:05:21 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:15 UTC