Re: [RFT][patch] Scheduling for HTT and not only

From: Attilio Rao <attilio_at_freebsd.org> Date: Fri, 6 Apr 2012 15:30:32 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:25 UTC

Il 06 aprile 2012 15:27, Alexander Motin <mav_at_freebsd.org> ha scritto:
> On 04/06/12 17:13, Attilio Rao wrote:
>>
>> Il 05 aprile 2012 19:12, Arnaud Lacombe<lacombar_at_gmail.com>  ha scritto:
>>>
>>> Hi,
>>>
>>> [Sorry for the delay, I got a bit sidetrack'ed...]
>>>
>>> 2012/2/17 Alexander Motin<mav_at_freebsd.org>:
>>>>
>>>> On 17.02.2012 18:53, Arnaud Lacombe wrote:
>>>>>
>>>>>
>>>>> On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motin<mav_at_freebsd.org>
>>>>>  wrote:
>>>>>>
>>>>>>
>>>>>> On 02/15/12 21:54, Jeff Roberson wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 15 Feb 2012, Alexander Motin wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I've decided to stop those cache black magic practices and focus on
>>>>>>>> things that really exist in this world -- SMT and CPU load. I've
>>>>>>>> dropped most of cache related things from the patch and made the
>>>>>>>> rest
>>>>>>>> of things more strict and predictable:
>>>>>>>> http://people.freebsd.org/~mav/sched.htt34.patch
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This looks great. I think there is value in considering the other
>>>>>>> approach further but I would like to do this part first. It would be
>>>>>>> nice to also add priority as a greater influence in the load
>>>>>>> balancing
>>>>>>> as well.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I haven't got good idea yet about balancing priorities, but I've
>>>>>> rewritten
>>>>>> balancer itself. As soon as sched_lowest() / sched_highest() are more
>>>>>> intelligent now, they allowed to remove topology traversing from the
>>>>>> balancer itself. That should fix double-swapping problem, allow to
>>>>>> keep
>>>>>> some
>>>>>> affinity while moving threads and make balancing more fair. I did
>>>>>> number
>>>>>> of
>>>>>> tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8
>>>>>> and
>>>>>> 16
>>>>>> threads everything is stationary as it should. With 9 threads I see
>>>>>> regular
>>>>>> and random load move between all 8 CPUs. Measurements on 5 minutes run
>>>>>> show
>>>>>> deviation of only about 5 seconds. It is the same deviation as I see
>>>>>> caused
>>>>>> by only scheduling of 16 threads on 8 cores without any balancing
>>>>>> needed
>>>>>> at
>>>>>> all. So I believe this code works as it should.
>>>>>>
>>>>>> Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch
>>>>>>
>>>>>> I plan this to be a final patch of this series (more to come :)) and
>>>>>> if
>>>>>> there will be no problems or objections, I am going to commit it
>>>>>> (except
>>>>>> some debugging KTRs) in about ten days. So now it's a good time for
>>>>>> reviews
>>>>>> and testing. :)
>>>>>>
>>>>> is there a place where all the patches are available ?
>>>>
>>>>
>>>>
>>>> All my scheduler patches are cumulative, so all you need is only the
>>>> last
>>>> mentioned here sched.htt40.patch.
>>>>
>>> You may want to have a look to the result I collected in the
>>> `runs/freebsd-experiments' branch of:
>>>
>>> https://github.com/lacombar/hackbench/
>>>
>>> and compare them with vanilla FreeBSD 9.0 and -CURRENT results
>>> available in `runs/freebsd'. On the dual package platform, your patch
>>> is not a definite win.
>>>
>>>> But in some cases, especially for multi-socket systems, to let it show
>>>> its
>>>> best, you may want to apply additional patch from avg_at_ to better detect
>>>> CPU
>>>> topology:
>>>>
>>>> https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd
>>>>
>>> test I conducted specifically for this patch did not showed much
>>> improvement...
>>
>>
>> Can you please clarify on this point?
>> The test you did included cases where the topology was detected badly
>> against cases where the topology was detected correctly as a patched
>> kernel (and you still didn't see a performance improvement), in terms
>> of cache line sharing?
>
>
> At this moment SCHED_ULE does almost nothing in terms of cache line sharing
> affinity (though it probably worth some further experiments). What this
> patch may improve is opposite case -- reduce cache sharing pressure for
> cache-hungry applications. For example, proper cache topology detection
> (such as lack of global L3 cache, but shared L2 per pairs of cores on
> Core2Quad class CPUs) increases pbzip2 performance when number of threads is
> less then number of CPUs (i.e. when there is place for optimization).

My asking is not referred to your patch really.
I just wanted to know if he correctly benchmark a case where the
topology was screwed up and then correctly recognized by avg's patch
in terms of cache level aggregation (it wasn't referred to your patch
btw).

Attilio

-- 
Peace can only be achieved by understanding - A. Einstein