Re: [RFC/RFT] calloutng

From: Alexander Motin <mav_at_FreeBSD.org>
Date: Sun, 13 Jan 2013 21:36:11 +0200
On 13.01.2013 20:09, Marius Strobl wrote:
> On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote:
>> On 06.01.2013 17:23, Marius Strobl wrote:
>>> I'm not really sure what to do about that. Earlier you already said
>>> that sched_bind(9) also isn't an option in case if td_critnest > 1.
>>> To be honest, I don't really unerstand why using a spin lock in the
>>> timecounter path makes sparc64 the only problematic architecture
>>> for your changes. The x86 i8254_get_timecount() also uses a spin lock
>>> so it should be in the same boat.
>>
>> The problem is not in using spinlock, but in waiting for other CPU while
>> spinlock is held. Other CPU may also hold spinlock and wait for
>> something, causing deadlock. i8254 code uses spinlock just to atomically
>> access hardware registers, so it causes no problems.
> 
> Okay, but wouldn't that be a general problem then? Pretty much
> anything triggering an IPI holds smp_ipi_mtx while doing so and
> the lower level IPI stuff waits for other CPU(s), including on
> x86.

The problem is general. But now it works because single smp_ipi_mtx is
used in all cases where IPI result is waited. As soon as spinning
happens with interrupts still enabled, there is no deadlocks. But
problem reappears if any different lock is used, or locks are nested.

In existing code in HEAD and 9 timecounters are never called with spin
mutex held.  I intentionally tried to avoid that in existing eventtimers
code. Callout code same time can be called in any environment with any
locks held. And new callout code may need to know precise current time
in any of those conditions. Attempt to use an IPI and wait there can be
fatal.

>>> The affected machines are equipped with a x86-style south bridge
>>> which exposes a powermanagment unit (intended to be used as a SMBus
>>> bridge only in these machines) on the PCI bus. Actually, this device
>>> also includes an ACPI power management timer. However, I've just
>>> spent a day trying to get that one working without success - it
>>> just doesn't increment. Probably its clock input isn't connected as
>>> it's not intended to be used in these machines.
>>> That south bridge also includes 8254 compatible timers on the ISA/
>>> LPC side, but are hidden from the OFW device tree. I can hack these
>>> devices into existence and give it a try, but even if that works this
>>> likely would use the same code as the x86 i8254_get_timecount() so I
>>> don't see what would be gained with that.
>>>
>>> The last thing in order to avoid using the tick counter as timecounter
>>> in the MP case I can think of is that the Broadcom MACs in the affected
>>> machines also provide a counter driven by a 1 MHz clock. If that's good
>>> enough for a timecounter I can hook these up (in case these work ...)
>>> and hack bge(4) to not detach in that case (given that we can't detach
>>> timecounters ...).
>>
>> i8254 on x86 is also just a bit above 1MHz.
>>
>>>> Here is small tool we are using for test correctness and performance of 
>>>> different user-level APIs: http://people.freebsd.org/~mav/testsleep.c
>>>>
>>>
>>> I've run Ian's set of tests on a v215 with and without your
>>> calloutng_12_26.patch and on a v210 (these uses the IPI approach)
>>> with the latter also applied.
>>> I'm not really sure what to make out of the numbers.
>>>
>>>                v215 w/o             v215 w/              v210 w/     
>>> ---------- ----------------     ----------------     ----------------
>>> select          1   1999.61          1     23.87          1     29.97
>>> poll            1   1999.70          1   1069.61          1   1075.24
>>> usleep          1   1999.86          1     23.43          1     28.99
>>> nanosleep       1    999.92          1     23.28          1     28.66
>>> kqueue          1   1000.12          1   1071.13          1   1076.35
>>> kqueueto        1    999.56          1     26.33          1     31.34
>>> syscall         1      1.89          1      1.92          1      2.88
> 
> FYI, these are the results of the v215 (btw., these (ab)use a bus
> cycle counter of the host-PCI-bridge as timecounter) with your
> calloutng_12_17.patch and kern.timecounter.alloweddeviation=0:
> select         1     23.82
> poll           1   1008.23
> usleep         1     23.31
> nanosleep      1     23.17
> kqueue         1   1010.35
> kqueueto       1     26.26
> syscall        1      1.91
> select       300    307.72
> poll         300   1008.23
> usleep       300    307.64
> nanosleep    300     23.21
> kqueue       300   1010.49
> kqueueto     300     26.27
> syscall      300      1.92
> select      3000   3009.95
> poll        3000   3013.33
> usleep      3000   3013.56
> nanosleep   3000     23.17
> kqueue      3000   3011.09
> kqueueto    3000     26.24
> syscall     3000      1.91
> select     30000  30013.51
> poll       30000  30010.63
> usleep     30000  30010.64
> nanosleep  30000     36.91
> kqueue     30000  30012.38
> kqueueto   30000     39.90
> syscall    30000      1.90
> select    300000 300017.52
> poll      300000 300013.00
> usleep    300000 300012.64
> nanosleep 300000    307.59
> kqueue    300000 300017.07
> kqueueto  300000    310.24
> syscall   300000      1.93

It seems that extra delay is only about 10-17us.

>> Numbers are not bad, respecting the fact that to protect from lost
>> interrupts eventtimer code on sparc64 now sets minimal programming
>> interval to 15us. It was made to reduce race window between the timer
>> read-modify-write and some long NMIs.
> 
> Uhm, there are no NMIs on sparc64. Does it make sense to bypass this
> adjustment on sparc64?

If it is not possible or not good to to stop timer during programming,
there will always be some race window between code execution and timer
ticking. So some minimal safety range should be reserved. Though it
probably can be significantly reduced. In case of x86/HPET there is
additional factor of NMI, that extends race to unpredictable level and
so makes additional post-read almost mandatory.

>> May be with rereading counter
>> after programming comparator (same as done for HPET, reading which is
>> probably much more expensive) this value could be reduced.
> 
> I see. There are some bigger fish to fry at the moment though :)

:)

-- 
Alexander Motin
Received on Sun Jan 13 2013 - 18:36:19 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:33 UTC