Re: [RFC/RFT] calloutng

From: Davide Italiano <davide_at_freebsd.org> Date: Fri, 14 Dec 2012 14:13:05 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:33 UTC

On Fri, Dec 14, 2012 at 1:57 PM, Davide Italiano <davide_at_freebsd.org> wrote:
> On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo <rizzo_at_iet.unipi.it> wrote:
>>
>> On Fri, Dec 14, 2012 at 12:12 AM, Davide Italiano <davide_at_freebsd.org>
>> wrote:
>>>
>>> Hi.
>>> This patch takes callout(9) and redesign the KPI and the
>>> implementation. The main objective of this work is making the
>>> subsystem tickless.  In the last several years, this possibility has
>>> been discussed widely (http://markmail.org/message/q3xmr2ttlzpqkmae),
>>> but until now noone really implemented that.
>>> If you want a complete history of what has been done in the last
>>> months you can check the calloutng project repository
>>> http://svnweb.freebsd.org/base/projects/calloutng/
>>> For lazy people, here's a summary:
>>
>>
>> thanks for the work and the detailed summary.
>> Perhaps it would be useful if you could provide a few high level
>> details on the use and performance of the new scheme, such as:
>>
>> - is the old callout KPI still available ? (i am asking because it would
>>   help maintaining third party kernel modules that are expected to
>>   work on different FreeBSD releases)
>>
>
> Obviously the old KPI is still available. callout(9) is a very popular
> interface and I don't think removing the old interface is a good idea,
> because could make unhappy some vendor when its code doesn't build
> anymore on FreeBSD.
>
>> - do you have numbers on what is the fastest rate at which callouts
>>   can be fired (e.g. say you have a callout which increments a
>>   counter and schedules the next callout in (struct bintime){0,1} ) ?
>>

Right now, all the services rely on the old interface. This means they
cannot be fired before 1 tick has elapsed, e.g. considering hz = 1000
on most of the machines, 1 millisecond.
Now that nanosleep() relies on the new interface, we measured 4-5
microseconds latency for the processing before the callout is actually
fired. I can't say if we can still lower this value, but I cannot
imagine, for now, a consumer that actually request a shorter timeout.

>>
>> - is there a possibility that if callout requests are too close to each
>>   other  (e.g. the above test) the thread dispatching callouts will
>>   run forever ? if so, is there a way to make such thread yield
>>   after a while ?
>>

Most of the processing is still done in a SWI thread, "at a later
time", so I don't think this is a problem.

>> - since you mentioned nanosleep() poll() and select() have been
>>   ported to the new callout, is there a way to guarantee that user
>>   using these functions with a very short timeout are actually
>>   descheduled as opposed to "interval too short, don't bother" ?
>>
>> - do you have numbers on how many calls per second we can
>>   have for a process that does
>>       for (;;) {  nanosleep(min_value_that_causes_descheduling);
>>

I don't follow you here.

>> I also have some comments on the diff:
>> - can you provide a diff -p ?
>>
>> - for several functions the only change is the name of an argument
>>   from "busy" to "us". Can you elaborate the reason for the change,
>>   and whether "us" means microseconds or the pronoun ?)
>>
>
> Please see r242905 by mav_at_.
>
>> Finally, a more substantial comment:
>> - a lot of functions which formerly had only a "timo" argument
>>   now have "timo, bt, precision, flags". Take seltdwait() as an example.
>>
>
> seltdwait() is not part of the public KPI. It has been modified to
> avoid code duplication. Having seltdwait() and seltdwait_bt(), i.e.
> two separate functions, even though we could share most of the code is
> not a clever approach, IMHO.
> As I told before, seltdwait() is not exposed so we can modify its
> argument without breaking anything.
>
>>   It seems that you have been undecided between two approaches:
>>   for some of these functions you have preserved the original function
>>   that deals with ticks and introduced a new one that deals with the
>> bintime,
>>   whereas in other cases you have modified the original function to add
>>   "bt, precision, flags".
>>
>
> I'm not. All the functions which are part of the public KPI (e.g.
> condvar(9), sleepq(9), *) are still available.  *_flags variants have
> been introduced so that consumers can take advantage of the new
> 'precision tolerance mechanism' implemented. Also, *_bt variants have
> been introduced. I don't see any "undecision" between the two
> approaches.
> Please note that now the callout backend deals with bintime, so every
> time callout_reset_on() is called, the 'tick' argument passed is
> silently converted to bintime.
>
>>   I would suggest a more uniform approach, namely:
>>   - preserve all the existing functions (T) that take a timeout in ticks;
>>   - add a new set of corresponding functions (BT) that take
>>     bt, precision, flags _instead_ of the ticks
>>   - the functions (T) make immediately the conversion from ticks to
>>     bintime(s), using macros or inline
>>   - optionally, convert kernel components to the new (BT) functions
>>     where this makes sense (e.g. we can exploit the finer-granularity
>>     of the new calls, etc.)
>>
>

This is the strategy we followed.

>
>
>> cheers
>> luigi
>>
>>  1) callout(9) is not anymore constrained to the resolution a periodic
>>>
>>> "hz" clock can give. In order to do that, the eventtimers(4) subsystem
>>> is used as backend.
>>> 2) Conversely from what discussed in past, we maintained the callwheel
>>> as underlying data structure for keeping track of the outstading
>>> timeouts. This choice has a couple of advantages, in particular we can
>>> still take benefits from the O(1) average complexity of the wheel for
>>> all the operations. Also, we thought the code duplication that would
>>> arise from the use of a two-staged backend for callout (e.g. use wheel
>>> for coarse resolution event and another data structure, such as an
>>> heap for high resolution events), is unacceptable. In fact, as long as
>>> callout gained the ability to migrate from a cpu to another having a
>>> double backend would mean doubling the code for the migration path.
>>> 3) A way to dispatch interrupts from hardware interrupt context has
>>> been implemented, using special callout flag. This has limited
>>> applicability, but avoid the dispatching of a SWI thread for handling
>>> specific callouts, avoiding the wake up of another CPU for processing
>>> and a (relatively useless) context switch
>>> 4) As long as new callout mechanism deals with bintime and not anymore
>>> with ticks, time is specified as absolute and not relative anymore. In
>>> order to get current time binuptime() or getbinuptime() is used, and a
>>> sysctl is introduced to selectively choose the function to use, based
>>> on a precision threshold.
>>> 5) A mechanism for specifying precision tolerance has been
>>> implemented. The callout processing mechanism has been adapted and the
>>> callout data structure augmented so that the codepath can take
>>> advantage and aggregate events which overlap in time.
>>>
>>>
>>> The new proposed KPI for callout is the following:
>>> callout_reset_bt_on(..., struct bintime time, struct bintime pr, ..., int
>>> flags)
>>> where ‘time’ argument represets the time at which the callout should
>>> fire, ‘pr’ represents the precision tolerance expressed as an absolute
>>> value, and ‘flags’, which could be used to specify new features, i.e.
>>> for now, the possibility to run the callout from fast interrupt
>>> context.
>>> The old KPI has been extended introducing the callout_reset_flags()
>>> function, which is the same of callout_reset*(), but takes an
>>> additional argument ‘int flags’ that can be used in the same fashion
>>> of the ‘flags’ argument for the new KPI. Using the ‘flags’ consumers
>>> can also specify relative precision tolerance in terms of power-of-two
>>> portion of the timeout passed as ticks.
>>> Using this strategy, the new precision mechanism can be used for the
>>> existing services without major modifications.
>>>
>>> Some consumers have been ported to the new KPI, in particular
>>> nanosleep(), poll(), select(), because they take immediate advantage
>>> from the arbitrary precision offered by the new infrastructure.
>>> For some statistics about the outcome of the conversion to the new
>>> service, please refer to the end of this e-mail:
>>> http://lists.freebsd.org/pipermail/freebsd-arch/2012-July/012756.html
>>> We didn't measure any significant performance regressions with
>>> hwmpc(4), using some benckmarks programs:
>>> http://people.freebsd.org/~davide/poll_test/poll_test.c
>>> http://people.freebsd.org/~mav/testsleep.c
>>> http://people.freebsd.org/~mav/testidle.c
>>>
>>> We tested the code on amd64, MIPS and arm. Any kind of testing or
>>> comment would be really appreciated. The full diff of the work against
>>> HEAD can be found at: http://people.freebsd.org/~davide/calloutng.diff
>>> If noone have objections, we plan to merge the repository to HEAD in a
>>> week or so.
>>>
>>> Thanks,
>>>
>>> Davide
>>> _______________________________________________
>>> freebsd-current_at_freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-current
>>> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>>
>>
>>
>>
>> --
>> -----------------------------------------+-------------------------------
>>  Prof. Luigi RIZZO, rizzo_at_iet.unipi.it  . Dip. di Ing. dell'Informazione
>>  http://www.iet.unipi.it/~luigi/        . Universita` di Pisa
>>  TEL      +39-050-2211611               . via Diotisalvi 2
>>  Mobile   +39-338-6809875               . 56122 PISA (Italy)
>> -----------------------------------------+-------------------------------
>>