I've looked at this issue some more and it appears to me that sched_pickcpu() is never called for CPU-bound threads that use up their slices completely and never do any voluntary switches. A result of this is that the threads stay on their initial CPU until re-balance mechanism kicks in. There also seem to be some peculiarities that adversely affect the rebalance algorithm and potentially other CPU selection/affinity code. 1. For single-socket SMP systems our current sched topology code builds a topology with a redundant layer. Here is a real example for a single-socket dual-core Core2 Duo system: <groups> <group level="1" cache-level="0"> <cpu count="2" mask="0x3">0, 1</cpu> <children> <group level="2" cache-level="2"> <cpu count="2" mask="0x3">0, 1</cpu> </group> </children> </group> </groups> As we can see both layers contain exactly the same set of CPUs. I will try to demonstrate below that having such a topology is not only redundant, but also harmful. 2. The algorithm for searching for CPUs with highest/lowest load within a CPU group is biased towards CPUs with lower IDs. E.g. if a CPU group contains CPUs with IDs 0..3 and all the CPUs have equal load, then CPU #0 is always selected. This happens as an obvious result of iterating from lower CPU IDs towards higher ones and selecting lowest loaded CPU by "this CPU load" < "lowest load so far" condition. It's also important to note that the algorithm above also takes into account a structure of a group being searched. E.g. when searching for a lowest loaded CPU in a group that has two subgroups, then we actually search for a lowest loaded CPU in a lowest loaded sub-group which may not be a CPU that has the lowest load overall. Say, we have 4 CPUs in a group and the group is sub-divided into two sub-groups with two CPUs in each. Let CPU loads be: 5 1 | 2 3 where a number corresponds to a load of a CPU and its position corresponds to CPU ID and the pipe symbol corresponds to a boundary between the sub-groups. Then the algorithm would select CPU 2 with its load of 2 as the lowest loaded, because its subgroup has a combined load of 2+3=5 which less than combined load of the other group (5+1=6), which contains the CPU with the lowest overall load of 1. Now let's assume the following configuration: 1 package x 2 cores x 2 HTT/SMT. Let CPU IDs be 0 .. 3. And let there be 5 CPU-bound threads. Our x86 topology code will create the following scheduling topology: level 1, cachelevel 0 ("shared nothing"), CPUs 0 .. 3 level 2, cache level 2 ("L2"), CPUs 0 .. 3 level 3, cache level 1 ("L1"), CPUs 0, 1 level 3, cache level 1 ("L1"), CPUs 2, 3 Let's assume initial load distribution of 1 1 | 1 2 where a number corresponds to a load of a CPU and its position corresponds to CPU ID from 0 to 3 and the pipe symbol corresponds to a boundary between the L1 sharing groups. Rebalancing algorithm first rebalances a top level group by moving some threads from a highest loaded CPU to a lowest loaded CPU. Then the procedure is repeated for each of the child groups, then their child groups, etc, until leaf groups are rebalanced. So, when the rebalancing is done for the top shared nothing group we get: 2 1 | 1 1 this is because CPU 0 is found to have the lowest load (as described above) and CPU 3 obviously has the highest load. Then the rebalancing is done in the L2 group which contains all the same CPUs and we get: 1 1 | 2 1 as described above, the "extra" thread is moved from the higher loaded sub-group to the lower loaded sub-group. Then the rebalanacing is done in each of the L1 groups. In the first group the load is already equal between CPUs and in the seconds group the load is moved from one CPU to the other: 1 1 | 1 2 So in the end of the rebalancing we get exactly the same load distribution as at the start. The extra load is always on the same CPU. I believe that this might describe the problem that Steve observes. But I am not quite sure of that without additional objective data. The thing is that the above reasoning demonstrates the load distribution only from the point of view of per-CPU run-queues. But it doesn't demonstrate how much of CPU time each individual _thread_ gets as a result. What would we have with the extra shared-nothing level? Let's start with the same initial distribution: 1 1 | 1 2 After L2 rebalancing: 2 1 | 1 1 After L1 rebalancing: 1 2 | 1 1 What happens at the next rebalancing run? After L2 rebalancing: 1 1 | 2 1 After L1 rebalancing: 1 1 | 1 2 So in this case the extra load oscillated between CPUs 1 and 3. Better but still not perfect. Perhaps the things could be improved if we "tossed a coin" when comparing equal loads. Or maybe some load history could be taken into account. But definitely we should not have CPU groups that contain exactly the same sets of CPUs. -- Andriy GaponReceived on Mon Aug 01 2011 - 10:53:35 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:16 UTC