Re: [RFC/RFT] calloutng

From: Luigi Rizzo <rizzo_at_iet.unipi.it> Date: Sun, 6 Jan 2013 17:20:49 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:33 UTC

On Sun, Jan 06, 2013 at 04:23:13PM +0100, Marius Strobl wrote:
> On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote:
> > On 26.12.2012 01:21, Marius Strobl wrote:
> > > On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
> > >> Experiments with dummynet shown ineffective support for very short
> > >> tick-based callouts. New version fixes that, allowing to get as many
> > >> tick-based callout events as hz value permits, while still be able to
> > >> aggregate events and generating minimum of interrupts.
> > >>
> > >> Also this version modifies system load average calculation to fix some
> > >> cases existing in HEAD and 9 branches, that could be fixed with new
> > >> direct callout functionality.
> > >>
> > >> http://people.freebsd.org/~mav/calloutng_12_17.patch
> > >>
> > >> With several important changes made last time I am going to delay commit
> > >> to HEAD for another week to do more testing. Comments and new test cases
> > >> are welcome. Thanks for staying tuned and commenting.
> > >
> > > FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
> > > try on sparc64 and it at least survives a buildworld there. However,
> > > with the patched kernels, buildworld times seem to increase slightly but
> > > reproducible by 1-2% (I only did four runs but typically buildworld
> > > times are rather stable and don't vary more than a minute for the
> > > same kernel and source here). Is this an expected trade-off (system
> > > time as such doesn't seem to increase)?
> > 
> > I don't think build process uses significant number of callouts to 
> > affect results directly. I think this additional time could be result of 
> > the deeper next event look up, done by the new code, that is practically 
> > useless for sparc64, which effectively has no cpu_idle() routine. It 
> > wouldn't affect system time and wouldn't show up in any statistics 
> > (except PMC or something alike) because it is executed inside timer 
> > hardware interrupt handler. If my guess is right, that is a part that 
> > probably still could be optimized. I'll look on it. Thanks.
> > 
> > > Is there anything specific to test?
> > 
> > Since the most of code is MI, for sparc64 I would mostly look on related 
> > MD parts (eventtimers and timecounters) to make sure they are working 
> > reliably in more stressful conditions.  I still have some worries about 
> > possible deadlock on hardware where IPIs are used to fetch present time 
> > from other CPU.
> 
> Well, I've just learnt two things the hard way:
> a) We really need the mutex in that path.
> b) Assuming that the initial synchronization of the counters is good
>    enough and they won't drift considerably accross the CPUs so we can
>    always use the local one makes things go south pretty soon after
>    boot. At least with your calloutng_12_26.patch applied.
> 
> I'm not really sure what to do about that. Earlier you already said
> that sched_bind(9) also isn't an option in case if td_critnest > 1.
> To be honest, I don't really unerstand why using a spin lock in the
> timecounter path makes sparc64 the only problematic architecture
> for your changes. The x86 i8254_get_timecount() also uses a spin lock
> so it should be in the same boat.
> 
> The affected machines are equipped with a x86-style south bridge
> which exposes a powermanagment unit (intended to be used as a SMBus
> bridge only in these machines) on the PCI bus. Actually, this device
> also includes an ACPI power management timer. However, I've just
> spent a day trying to get that one working without success - it
> just doesn't increment. Probably its clock input isn't connected as
> it's not intended to be used in these machines.
> That south bridge also includes 8254 compatible timers on the ISA/
> LPC side, but are hidden from the OFW device tree. I can hack these
> devices into existence and give it a try, but even if that works this
> likely would use the same code as the x86 i8254_get_timecount() so I
> don't see what would be gained with that.
> 
> The last thing in order to avoid using the tick counter as timecounter
> in the MP case I can think of is that the Broadcom MACs in the affected
> machines also provide a counter driven by a 1 MHz clock. If that's good
> enough for a timecounter I can hook these up (in case these work ...)
> and hack bge(4) to not detach in that case (given that we can't detach
> timecounters ...).
> 
> > 
> > Here is small tool we are using for test correctness and performance of 
> > different user-level APIs: http://people.freebsd.org/~mav/testsleep.c
> > 
> 
> I've run Ian's set of tests on a v215 with and without your
> calloutng_12_26.patch and on a v210 (these uses the IPI approach)
> with the latter also applied.
> I'm not really sure what to make out of the numbers.
> 
>                v215 w/o             v215 w/              v210 w/     
> ---------- ----------------     ----------------     ----------------
> select          1   1999.61          1     23.87          1     29.97
> poll            1   1999.70          1   1069.61          1   1075.24
> usleep          1   1999.86          1     23.43          1     28.99
> nanosleep       1    999.92          1     23.28          1     28.66
> kqueue          1   1000.12          1   1071.13          1   1076.35
> kqueueto        1    999.56          1     26.33          1     31.34
> syscall         1      1.89          1      1.92          1      2.88
> select        300   1999.72        300    326.08        300    332.24
> poll          300   1999.12        300   1069.78        300   1075.82
> usleep        300   1999.91        300    325.63        300    330.94
> nanosleep     300    999.82        300     23.25        300     26.76
> kqueue        300   1000.14        300   1071.06        300   1075.96
> kqueueto      300    999.53        300     26.32        300     31.42
> syscall       300      1.90        300      1.93        300      2.89
> select       3000   3998.18       3000   3176.51       3000   3193.86
> poll         3000   3999.29       3000   3182.21       3000   3193.12
> usleep       3000   3998.46       3000   3191.60       3000   3192.50
> nanosleep    3000   1999.79       3000     23.21       3000     27.02
> kqueue       3000   3000.12       3000   3189.13       3000   3191.96
> kqueueto     3000   1999.99       3000     26.28       3000     31.91
> syscall      3000      1.91       3000      1.91       3000      2.90
> select      30000  30990.85      30000  31489.18      30000  31548.77
> poll        30000  30995.25      30000  31518.80      30000  31487.92
> usleep      30000  30992.00      30000  31510.42      30000  31475.50
> nanosleep   30000   1999.46      30000     38.67      30000     41.95
> kqueue      30000  30006.49      30000  30991.86      30000  30996.77
> kqueueto    30000   1999.09      30000     41.67      30000     46.36
> syscall     30000      1.91      30000      1.91      30000      2.88
> select     300000 300990.65     300000 301864.98     300000 301787.01
> poll       300000 300998.09     300000 301831.36     300000 301741.62
> usleep     300000 300990.80     300000 301824.67     300000 301770.10
> nanosleep  300000   1999.15     300000    325.74     300000    331.01
> kqueue     300000 300000.87     300000 301031.11     300000 300992.28
> kqueueto   300000   1999.39     300000    328.77     300000    333.45
> syscall    300000      1.91     300000      1.91     300000      2.88

the nanosleep and kqueueto tests are probably passing the wrong
argument to the syscall (meant to be microseconds, but nanosleep
takes nanosecond so it should probably be multiplied by 1000).

I think that for the time being it would be useful to run at least
one set of tests with kern.timecounter.alloweddeviation=0 so we can
tell how close we get to the required timeouts

cheers
luigi

> Marius
> 
> _______________________________________________
> freebsd-arch_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe_at_freebsd.org"