On 04/06/12 17:30, Attilio Rao wrote: > Il 06 aprile 2012 15:27, Alexander Motin<mav_at_freebsd.org> ha scritto: >> On 04/06/12 17:13, Attilio Rao wrote: >>> >>> Il 05 aprile 2012 19:12, Arnaud Lacombe<lacombar_at_gmail.com> ha scritto: >>>> >>>> Hi, >>>> >>>> [Sorry for the delay, I got a bit sidetrack'ed...] >>>> >>>> 2012/2/17 Alexander Motin<mav_at_freebsd.org>: >>>>> >>>>> On 17.02.2012 18:53, Arnaud Lacombe wrote: >>>>>> >>>>>> >>>>>> On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motin<mav_at_freebsd.org> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> On 02/15/12 21:54, Jeff Roberson wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Wed, 15 Feb 2012, Alexander Motin wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> I've decided to stop those cache black magic practices and focus on >>>>>>>>> things that really exist in this world -- SMT and CPU load. I've >>>>>>>>> dropped most of cache related things from the patch and made the >>>>>>>>> rest >>>>>>>>> of things more strict and predictable: >>>>>>>>> http://people.freebsd.org/~mav/sched.htt34.patch >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> This looks great. I think there is value in considering the other >>>>>>>> approach further but I would like to do this part first. It would be >>>>>>>> nice to also add priority as a greater influence in the load >>>>>>>> balancing >>>>>>>> as well. >>>>>>> >>>>>>> >>>>>>> >>>>>>> I haven't got good idea yet about balancing priorities, but I've >>>>>>> rewritten >>>>>>> balancer itself. As soon as sched_lowest() / sched_highest() are more >>>>>>> intelligent now, they allowed to remove topology traversing from the >>>>>>> balancer itself. That should fix double-swapping problem, allow to >>>>>>> keep >>>>>>> some >>>>>>> affinity while moving threads and make balancing more fair. I did >>>>>>> number >>>>>>> of >>>>>>> tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 >>>>>>> and >>>>>>> 16 >>>>>>> threads everything is stationary as it should. With 9 threads I see >>>>>>> regular >>>>>>> and random load move between all 8 CPUs. Measurements on 5 minutes run >>>>>>> show >>>>>>> deviation of only about 5 seconds. It is the same deviation as I see >>>>>>> caused >>>>>>> by only scheduling of 16 threads on 8 cores without any balancing >>>>>>> needed >>>>>>> at >>>>>>> all. So I believe this code works as it should. >>>>>>> >>>>>>> Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch >>>>>>> >>>>>>> I plan this to be a final patch of this series (more to come :)) and >>>>>>> if >>>>>>> there will be no problems or objections, I am going to commit it >>>>>>> (except >>>>>>> some debugging KTRs) in about ten days. So now it's a good time for >>>>>>> reviews >>>>>>> and testing. :) >>>>>>> >>>>>> is there a place where all the patches are available ? >>>>> >>>>> >>>>> >>>>> All my scheduler patches are cumulative, so all you need is only the >>>>> last >>>>> mentioned here sched.htt40.patch. >>>>> >>>> You may want to have a look to the result I collected in the >>>> `runs/freebsd-experiments' branch of: >>>> >>>> https://github.com/lacombar/hackbench/ >>>> >>>> and compare them with vanilla FreeBSD 9.0 and -CURRENT results >>>> available in `runs/freebsd'. On the dual package platform, your patch >>>> is not a definite win. >>>> >>>>> But in some cases, especially for multi-socket systems, to let it show >>>>> its >>>>> best, you may want to apply additional patch from avg_at_ to better detect >>>>> CPU >>>>> topology: >>>>> >>>>> https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd >>>>> >>>> test I conducted specifically for this patch did not showed much >>>> improvement... >>> >>> >>> Can you please clarify on this point? >>> The test you did included cases where the topology was detected badly >>> against cases where the topology was detected correctly as a patched >>> kernel (and you still didn't see a performance improvement), in terms >>> of cache line sharing? >> >> >> At this moment SCHED_ULE does almost nothing in terms of cache line sharing >> affinity (though it probably worth some further experiments). What this >> patch may improve is opposite case -- reduce cache sharing pressure for >> cache-hungry applications. For example, proper cache topology detection >> (such as lack of global L3 cache, but shared L2 per pairs of cores on >> Core2Quad class CPUs) increases pbzip2 performance when number of threads is >> less then number of CPUs (i.e. when there is place for optimization). > > My asking is not referred to your patch really. > I just wanted to know if he correctly benchmark a case where the > topology was screwed up and then correctly recognized by avg's patch > in terms of cache level aggregation (it wasn't referred to your patch > btw). I understand. I've just described test case when properly detected topology could give benefit. What the test really does is indeed a good question. -- Alexander MotinReceived on Fri Apr 06 2012 - 12:41:14 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:25 UTC