Re: [RFC/RFT] calloutng

From: Luigi Rizzo <rizzo_at_iet.unipi.it> Date: Fri, 14 Dec 2012 07:41:07 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:33 UTC

On Fri, Dec 14, 2012 at 12:12 AM, Davide Italiano <davide_at_freebsd.org>wrote:

> Hi.
> This patch takes callout(9) and redesign the KPI and the
> implementation. The main objective of this work is making the
> subsystem tickless.  In the last several years, this possibility has
> been discussed widely (http://markmail.org/message/q3xmr2ttlzpqkmae),
> but until now noone really implemented that.
> If you want a complete history of what has been done in the last
> months you can check the calloutng project repository
> http://svnweb.freebsd.org/base/projects/calloutng/
> For lazy people, here's a summary:
>

thanks for the work and the detailed summary.
Perhaps it would be useful if you could provide a few high level
details on the use and performance of the new scheme, such as:

- is the old callout KPI still available ? (i am asking because it would
  help maintaining third party kernel modules that are expected to
  work on different FreeBSD releases)

- do you have numbers on what is the fastest rate at which callouts
  can be fired (e.g. say you have a callout which increments a
  counter and schedules the next callout in (struct bintime){0,1} ) ?

- is there a possibility that if callout requests are too close to each
  other  (e.g. the above test) the thread dispatching callouts will
  run forever ? if so, is there a way to make such thread yield
  after a while ?

- since you mentioned nanosleep() poll() and select() have been
  ported to the new callout, is there a way to guarantee that user
  using these functions with a very short timeout are actually
  descheduled as opposed to "interval too short, don't bother" ?

- do you have numbers on how many calls per second we can
  have for a process that does
      for (;;) {  nanosleep(min_value_that_causes_descheduling);

I also have some comments on the diff:
- can you provide a diff -p ?

- for several functions the only change is the name of an argument
  from "busy" to "us". Can you elaborate the reason for the change,
  and whether "us" means microseconds or the pronoun ?)

Finally, a more substantial comment:
- a lot of functions which formerly had only a "timo" argument
  now have "timo, bt, precision, flags". Take seltdwait() as an example.

  It seems that you have been undecided between two approaches:
  for some of these functions you have preserved the original function
  that deals with ticks and introduced a new one that deals with the
bintime,
  whereas in other cases you have modified the original function to add
  "bt, precision, flags".

  I would suggest a more uniform approach, namely:
  - preserve all the existing functions (T) that take a timeout in ticks;
  - add a new set of corresponding functions (BT) that take
    bt, precision, flags _instead_ of the ticks
  - the functions (T) make immediately the conversion from ticks to
    bintime(s), using macros or inline
  - optionally, convert kernel components to the new (BT) functions
    where this makes sense (e.g. we can exploit the finer-granularity
    of the new calls, etc.)

cheers
luigi

 1) callout(9) is not anymore constrained to the resolution a periodic

> "hz" clock can give. In order to do that, the eventtimers(4) subsystem
> is used as backend.
> 2) Conversely from what discussed in past, we maintained the callwheel
> as underlying data structure for keeping track of the outstading
> timeouts. This choice has a couple of advantages, in particular we can
> still take benefits from the O(1) average complexity of the wheel for
> all the operations. Also, we thought the code duplication that would
> arise from the use of a two-staged backend for callout (e.g. use wheel
> for coarse resolution event and another data structure, such as an
> heap for high resolution events), is unacceptable. In fact, as long as
> callout gained the ability to migrate from a cpu to another having a
> double backend would mean doubling the code for the migration path.
> 3) A way to dispatch interrupts from hardware interrupt context has
> been implemented, using special callout flag. This has limited
> applicability, but avoid the dispatching of a SWI thread for handling
> specific callouts, avoiding the wake up of another CPU for processing
> and a (relatively useless) context switch
> 4) As long as new callout mechanism deals with bintime and not anymore
> with ticks, time is specified as absolute and not relative anymore. In
> order to get current time binuptime() or getbinuptime() is used, and a
> sysctl is introduced to selectively choose the function to use, based
> on a precision threshold.
> 5) A mechanism for specifying precision tolerance has been
> implemented. The callout processing mechanism has been adapted and the
> callout data structure augmented so that the codepath can take
> advantage and aggregate events which overlap in time.
>
>
> The new proposed KPI for callout is the following:
> callout_reset_bt_on(..., struct bintime time, struct bintime pr, ..., int
> flags)
> where ‘time’ argument represets the time at which the callout should
> fire, ‘pr’ represents the precision tolerance expressed as an absolute
> value, and ‘flags’, which could be used to specify new features, i.e.
> for now, the possibility to run the callout from fast interrupt
> context.
> The old KPI has been extended introducing the callout_reset_flags()
> function, which is the same of callout_reset*(), but takes an
> additional argument ‘int flags’ that can be used in the same fashion
> of the ‘flags’ argument for the new KPI. Using the ‘flags’ consumers
> can also specify relative precision tolerance in terms of power-of-two
> portion of the timeout passed as ticks.
> Using this strategy, the new precision mechanism can be used for the
> existing services without major modifications.
>
> Some consumers have been ported to the new KPI, in particular
> nanosleep(), poll(), select(), because they take immediate advantage
> from the arbitrary precision offered by the new infrastructure.
> For some statistics about the outcome of the conversion to the new
> service, please refer to the end of this e-mail:
> http://lists.freebsd.org/pipermail/freebsd-arch/2012-July/012756.html
> We didn't measure any significant performance regressions with
> hwmpc(4), using some benckmarks programs:
> http://people.freebsd.org/~davide/poll_test/poll_test.c
> http://people.freebsd.org/~mav/testsleep.c
> http://people.freebsd.org/~mav/testidle.c
>
> We tested the code on amd64, MIPS and arm. Any kind of testing or
> comment would be really appreciated. The full diff of the work against
> HEAD can be found at: http://people.freebsd.org/~davide/calloutng.diff
> If noone have objections, we plan to merge the repository to HEAD in a
> week or so.
>
> Thanks,
>
> Davide
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>

-- 
-----------------------------------------+-------------------------------
 Prof. Luigi RIZZO, rizzo_at_iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/        . Universita` di Pisa
 TEL      +39-050-2211611               . via Diotisalvi 2
 Mobile   +39-338-6809875               . 56122 PISA (Italy)
-----------------------------------------+-------------------------------