Re: [RFC/RFT] calloutng

From: Marius Strobl <marius_at_alchemy.franken.de>
Date: Sun, 13 Jan 2013 19:09:40 +0100
On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote:
> On 06.01.2013 17:23, Marius Strobl wrote:
> > On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote:
> >> On 26.12.2012 01:21, Marius Strobl wrote:
> >>> On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
> >>>> Experiments with dummynet shown ineffective support for very short
> >>>> tick-based callouts. New version fixes that, allowing to get as many
> >>>> tick-based callout events as hz value permits, while still be able to
> >>>> aggregate events and generating minimum of interrupts.
> >>>>
> >>>> Also this version modifies system load average calculation to fix some
> >>>> cases existing in HEAD and 9 branches, that could be fixed with new
> >>>> direct callout functionality.
> >>>>
> >>>> http://people.freebsd.org/~mav/calloutng_12_17.patch
> >>>>
> >>>> With several important changes made last time I am going to delay commit
> >>>> to HEAD for another week to do more testing. Comments and new test cases
> >>>> are welcome. Thanks for staying tuned and commenting.
> >>>
> >>> FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
> >>> try on sparc64 and it at least survives a buildworld there. However,
> >>> with the patched kernels, buildworld times seem to increase slightly but
> >>> reproducible by 1-2% (I only did four runs but typically buildworld
> >>> times are rather stable and don't vary more than a minute for the
> >>> same kernel and source here). Is this an expected trade-off (system
> >>> time as such doesn't seem to increase)?
> >>
> >> I don't think build process uses significant number of callouts to 
> >> affect results directly. I think this additional time could be result of 
> >> the deeper next event look up, done by the new code, that is practically 
> >> useless for sparc64, which effectively has no cpu_idle() routine. It 
> >> wouldn't affect system time and wouldn't show up in any statistics 
> >> (except PMC or something alike) because it is executed inside timer 
> >> hardware interrupt handler. If my guess is right, that is a part that 
> >> probably still could be optimized. I'll look on it. Thanks.
> >>
> >>> Is there anything specific to test?
> >>
> >> Since the most of code is MI, for sparc64 I would mostly look on related 
> >> MD parts (eventtimers and timecounters) to make sure they are working 
> >> reliably in more stressful conditions.  I still have some worries about 
> >> possible deadlock on hardware where IPIs are used to fetch present time 
> >> from other CPU.
> > 
> > Well, I've just learnt two things the hard way:
> > a) We really need the mutex in that path.
> > b) Assuming that the initial synchronization of the counters is good
> >    enough and they won't drift considerably accross the CPUs so we can
> >    always use the local one makes things go south pretty soon after
> >    boot. At least with your calloutng_12_26.patch applied.
> 
> Do you think it means they are not really synchronized for some reason?

There's definitely no hardware in place which would synchronize them.
I've no idea how to properly measure the difference between two tick
counters, but I think it's rarther their drift and not the software
synchronization we do when starting APs that is causing problems.
Mainly, because I can't really think of a better algorithm for doing
the latter when startiing the APs. The symptoms are that bout 30 to
60 seconds after that I start to see weird timeouts from device
drivers. I'd need to check how long these timeouts actually are to
see whether it could be a problem right from the start though. In
any case, it seems foolish to think that synchronizing them once
would be sufficient and they won't drift until shutdown. Linux
probably also doesn't keep re-synchronize them without a reason.
Just using a single timecounter source simply appears to be the
better choice.

> 
> > I'm not really sure what to do about that. Earlier you already said
> > that sched_bind(9) also isn't an option in case if td_critnest > 1.
> > To be honest, I don't really unerstand why using a spin lock in the
> > timecounter path makes sparc64 the only problematic architecture
> > for your changes. The x86 i8254_get_timecount() also uses a spin lock
> > so it should be in the same boat.
> 
> The problem is not in using spinlock, but in waiting for other CPU while
> spinlock is held. Other CPU may also hold spinlock and wait for
> something, causing deadlock. i8254 code uses spinlock just to atomically
> access hardware registers, so it causes no problems.

Okay, but wouldn't that be a general problem then? Pretty much
anything triggering an IPI holds smp_ipi_mtx while doing so and
the lower level IPI stuff waits for other CPU(s), including on
x86.

> 
> > The affected machines are equipped with a x86-style south bridge
> > which exposes a powermanagment unit (intended to be used as a SMBus
> > bridge only in these machines) on the PCI bus. Actually, this device
> > also includes an ACPI power management timer. However, I've just
> > spent a day trying to get that one working without success - it
> > just doesn't increment. Probably its clock input isn't connected as
> > it's not intended to be used in these machines.
> > That south bridge also includes 8254 compatible timers on the ISA/
> > LPC side, but are hidden from the OFW device tree. I can hack these
> > devices into existence and give it a try, but even if that works this
> > likely would use the same code as the x86 i8254_get_timecount() so I
> > don't see what would be gained with that.
> > 
> > The last thing in order to avoid using the tick counter as timecounter
> > in the MP case I can think of is that the Broadcom MACs in the affected
> > machines also provide a counter driven by a 1 MHz clock. If that's good
> > enough for a timecounter I can hook these up (in case these work ...)
> > and hack bge(4) to not detach in that case (given that we can't detach
> > timecounters ...).
> 
> i8254 on x86 is also just a bit above 1MHz.
> 
> >> Here is small tool we are using for test correctness and performance of 
> >> different user-level APIs: http://people.freebsd.org/~mav/testsleep.c
> >>
> > 
> > I've run Ian's set of tests on a v215 with and without your
> > calloutng_12_26.patch and on a v210 (these uses the IPI approach)
> > with the latter also applied.
> > I'm not really sure what to make out of the numbers.
> > 
> >                v215 w/o             v215 w/              v210 w/     
> > ---------- ----------------     ----------------     ----------------
> > select          1   1999.61          1     23.87          1     29.97
> > poll            1   1999.70          1   1069.61          1   1075.24
> > usleep          1   1999.86          1     23.43          1     28.99
> > nanosleep       1    999.92          1     23.28          1     28.66
> > kqueue          1   1000.12          1   1071.13          1   1076.35
> > kqueueto        1    999.56          1     26.33          1     31.34
> > syscall         1      1.89          1      1.92          1      2.88

FYI, these are the results of the v215 (btw., these (ab)use a bus
cycle counter of the host-PCI-bridge as timecounter) with your
calloutng_12_17.patch and kern.timecounter.alloweddeviation=0:
select         1     23.82
poll           1   1008.23
usleep         1     23.31
nanosleep      1     23.17
kqueue         1   1010.35
kqueueto       1     26.26
syscall        1      1.91
select       300    307.72
poll         300   1008.23
usleep       300    307.64
nanosleep    300     23.21
kqueue       300   1010.49
kqueueto     300     26.27
syscall      300      1.92
select      3000   3009.95
poll        3000   3013.33
usleep      3000   3013.56
nanosleep   3000     23.17
kqueue      3000   3011.09
kqueueto    3000     26.24
syscall     3000      1.91
select     30000  30013.51
poll       30000  30010.63
usleep     30000  30010.64
nanosleep  30000     36.91
kqueue     30000  30012.38
kqueueto   30000     39.90
syscall    30000      1.90
select    300000 300017.52
poll      300000 300013.00
usleep    300000 300012.64
nanosleep 300000    307.59
kqueue    300000 300017.07
kqueueto  300000    310.24
syscall   300000      1.93

 
> 
> Numbers are not bad, respecting the fact that to protect from lost
> interrupts eventtimer code on sparc64 now sets minimal programming
> interval to 15us. It was made to reduce race window between the timer
> read-modify-write and some long NMIs.

Uhm, there are no NMIs on sparc64. Does it make sense to bypass this
adjustment on sparc64?

> May be with rereading counter
> after programming comparator (same as done for HPET, reading which is
> probably much more expensive) this value could be reduced.
> 

I see. There are some bigger fish to fry at the moment though :)

Marius
Received on Sun Jan 13 2013 - 17:09:42 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:33 UTC