Re: [RFC/RFT] calloutng

From: Alexander Motin <mav_at_FreeBSD.org>
Date: Tue, 08 Jan 2013 12:46:57 +0200
On 06.01.2013 17:23, Marius Strobl wrote:
> On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote:
>> On 26.12.2012 01:21, Marius Strobl wrote:
>>> On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
>>>> Experiments with dummynet shown ineffective support for very short
>>>> tick-based callouts. New version fixes that, allowing to get as many
>>>> tick-based callout events as hz value permits, while still be able to
>>>> aggregate events and generating minimum of interrupts.
>>>>
>>>> Also this version modifies system load average calculation to fix some
>>>> cases existing in HEAD and 9 branches, that could be fixed with new
>>>> direct callout functionality.
>>>>
>>>> http://people.freebsd.org/~mav/calloutng_12_17.patch
>>>>
>>>> With several important changes made last time I am going to delay commit
>>>> to HEAD for another week to do more testing. Comments and new test cases
>>>> are welcome. Thanks for staying tuned and commenting.
>>>
>>> FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
>>> try on sparc64 and it at least survives a buildworld there. However,
>>> with the patched kernels, buildworld times seem to increase slightly but
>>> reproducible by 1-2% (I only did four runs but typically buildworld
>>> times are rather stable and don't vary more than a minute for the
>>> same kernel and source here). Is this an expected trade-off (system
>>> time as such doesn't seem to increase)?
>>
>> I don't think build process uses significant number of callouts to 
>> affect results directly. I think this additional time could be result of 
>> the deeper next event look up, done by the new code, that is practically 
>> useless for sparc64, which effectively has no cpu_idle() routine. It 
>> wouldn't affect system time and wouldn't show up in any statistics 
>> (except PMC or something alike) because it is executed inside timer 
>> hardware interrupt handler. If my guess is right, that is a part that 
>> probably still could be optimized. I'll look on it. Thanks.
>>
>>> Is there anything specific to test?
>>
>> Since the most of code is MI, for sparc64 I would mostly look on related 
>> MD parts (eventtimers and timecounters) to make sure they are working 
>> reliably in more stressful conditions.  I still have some worries about 
>> possible deadlock on hardware where IPIs are used to fetch present time 
>> from other CPU.
> 
> Well, I've just learnt two things the hard way:
> a) We really need the mutex in that path.
> b) Assuming that the initial synchronization of the counters is good
>    enough and they won't drift considerably accross the CPUs so we can
>    always use the local one makes things go south pretty soon after
>    boot. At least with your calloutng_12_26.patch applied.

Do you think it means they are not really synchronized for some reason?

> I'm not really sure what to do about that. Earlier you already said
> that sched_bind(9) also isn't an option in case if td_critnest > 1.
> To be honest, I don't really unerstand why using a spin lock in the
> timecounter path makes sparc64 the only problematic architecture
> for your changes. The x86 i8254_get_timecount() also uses a spin lock
> so it should be in the same boat.

The problem is not in using spinlock, but in waiting for other CPU while
spinlock is held. Other CPU may also hold spinlock and wait for
something, causing deadlock. i8254 code uses spinlock just to atomically
access hardware registers, so it causes no problems.

> The affected machines are equipped with a x86-style south bridge
> which exposes a powermanagment unit (intended to be used as a SMBus
> bridge only in these machines) on the PCI bus. Actually, this device
> also includes an ACPI power management timer. However, I've just
> spent a day trying to get that one working without success - it
> just doesn't increment. Probably its clock input isn't connected as
> it's not intended to be used in these machines.
> That south bridge also includes 8254 compatible timers on the ISA/
> LPC side, but are hidden from the OFW device tree. I can hack these
> devices into existence and give it a try, but even if that works this
> likely would use the same code as the x86 i8254_get_timecount() so I
> don't see what would be gained with that.
> 
> The last thing in order to avoid using the tick counter as timecounter
> in the MP case I can think of is that the Broadcom MACs in the affected
> machines also provide a counter driven by a 1 MHz clock. If that's good
> enough for a timecounter I can hook these up (in case these work ...)
> and hack bge(4) to not detach in that case (given that we can't detach
> timecounters ...).

i8254 on x86 is also just a bit above 1MHz.

>> Here is small tool we are using for test correctness and performance of 
>> different user-level APIs: http://people.freebsd.org/~mav/testsleep.c
>>
> 
> I've run Ian's set of tests on a v215 with and without your
> calloutng_12_26.patch and on a v210 (these uses the IPI approach)
> with the latter also applied.
> I'm not really sure what to make out of the numbers.
> 
>                v215 w/o             v215 w/              v210 w/     
> ---------- ----------------     ----------------     ----------------
> select          1   1999.61          1     23.87          1     29.97
> poll            1   1999.70          1   1069.61          1   1075.24
> usleep          1   1999.86          1     23.43          1     28.99
> nanosleep       1    999.92          1     23.28          1     28.66
> kqueue          1   1000.12          1   1071.13          1   1076.35
> kqueueto        1    999.56          1     26.33          1     31.34
> syscall         1      1.89          1      1.92          1      2.88

Numbers are not bad, respecting the fact that to protect from lost
interrupts eventtimer code on sparc64 now sets minimal programming
interval to 15us. It was made to reduce race window between the timer
read-modify-write and some long NMIs. May be with rereading counter
after programming comparator (same as done for HPET, reading which is
probably much more expensive) this value could be reduced.

-- 
Alexander Motin
Received on Tue Jan 08 2013 - 09:47:04 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:33 UTC