Re: ntpd segfaults on start

From: Warner Losh <imp_at_bsdimp.com>
Date: Mon, 9 Sep 2019 15:42:43 -0600
On Mon, Sep 9, 2019 at 3:12 PM Ian Lepore <ian_at_freebsd.org> wrote:

> On Mon, 2019-09-09 at 21:44 +0300, Konstantin Belousov wrote:
> > On Mon, Sep 09, 2019 at 12:13:24PM -0600, Ian Lepore wrote:
> > > On Mon, 2019-09-09 at 09:30 -0700, Rodney W. Grimes wrote:
> > > > > On Sat, 2019-09-07 at 09:28 -0700, Cy Schubert wrote:
> > > > > > In message <20190907161749.GJ2559_at_kib.kiev.ua>, Konstantin
> > > > > > Belousov writes:
> > > > > > > On Sat, Sep 07, 2019 at 08:45:21AM -0700, Cy Schubert
> > > > > > > wrote:
> > > > > > > > [...]
> > > >
> > > > Doesn't locking this memory down also protect ntpd from OOM kills?
> > > > If so that is a MUST preserve functionality, as IMHO killing ntpd
> > > > on a box that has it configured is a total no win situation.
> > > >
> > >
> > > Does it have that effect?  I don't know.  But I would argue that that's
> > > a separate issue, and we should make that happen by adding
> > > ntpd_oomprotect=YES to /etc/defaults/rc.conf
> >
> > Wiring process memory has no effect on OOM selection. More, because
> > all potentially allocated pages are allocated for real after mlockall(),
> > the size of the vmspace, as accounted by OOM, is the largest possible
> > size from the whole lifetime.
> >
> > On the other hand, the code execution times are not predictable if the
> > process's pages can be paged out. Under severe load next instruction
> > might take several seconds or even minutes to start. It is quite unlike
> > the scheduler delays. That introduces a jitter in the local time
> > measurements and their usage as done in userspace. Wouldn't this affect
> > the accuracy ?
> >
>
> IMO, there is a large gap between "in theory, paging could cause
> indeterminate delays in code execution" and "time will be inaccurate on
> your system".  If there were a delay in a part of the code where it
> matters that amounted to "seconds or even minutes", what you'd end up
> with is a measurement that would be discarded by the median filter as
> an outlier.  There would be some danger that if that kind of delay
> happened for too many polling cycles in a row, you'd end up with no
> usable measurements after a while and clock accuracy would suffer.
> Sub-second delays would be more worriesome because they might not be
> rejected as outliers.
>
> There are only a couple code paths in freebsd ntpd processing where a
> paging (or scheduling) delay could cause measurement inaccuracy:
>
>  - When stepping the clock, the code that runs between calling
> clock_gettime() and calling clock_settime() to apply the step
> adjustment to the clock.
>
>  - When beginning an exchange with or replying to a peer, the code that
> runs between obtaining system time for the outgoing Transmit Timestamp
> and actually transmitting that packet.
>
> Stepping the clock typically only happens once at startup.  The ntpd
> code itself recognizes that this is a time-critical path (it has
> comments to that effect) but unfortunately the code that runs is
> scattered among several different .c files so it's hard to say what the
> likelyhood is that code in the critical section will all be in the same
> page (or be already-resident because other startup-time code faulted in
> those pages).  IMO, the right fix for this would be a kernel interface
> that let you apply a step-delta to the clock with a single syscall
> (perhaps as an extension to the existing ntp_adjtime() using a new mode
> flag).
>
> On freebsd, the Receive timestamps are captured in the kernel and
> delivered along with the packet to userland, and are retrieved by the
> ntpd code from the SCM_BINTIME control message in the packet, so there
> is no latency problem in the receive path.  There isn't a corresponding
> kernel mechanism for setting the outgoing timestamps, so whether it's
> originating a request to a peer or replying to a request from a peer,
> the transmit timestamp could be wrong due to:
>
>  - paging delays
>  - scheduler delays
>  - network stack, outgoing queues, and driver delays
>
> So the primary vulnerability is on the transmit path between obtaining
> system time and the packet leaving the system.  A quick glance at that
> code makes me think that most of the data being touched has already
> been referenced pretty recently during the process of assembling the
> outgoing packet, so it's unlikely that storing the timestamp into the
> outgoing packet or the other bit of work that happens after that
> triggers a pagein unless the system is pathologically overloaded.
> Naturally, obtaining the timestamp and putting it into the packet is
> one of the last things it does before sending, so the code path is
> relatively short, but it's not clear to me whether it's likely or not
> that the code involved all lives in the same page.  Still, it's one of
> the heavily exercised paths within ntpd, which should increase the odds
> of the pages being resident because of recent use.
>
> So, I'm not disputing the point that a sufficiently overloaded system
> can lead to an indeterminate delay between *any* two instructions
> executed in userland.  What I've said above is more along the lines of
> considering the usual situation, not the most pathlogical one.  In the
> most pathological cases, either the delays introduced are fairly minor
> and you get some minor jitter in system time (ameliorated by the median
> filtering built in to ntpd), or the delays are major (a full second or
> more) and get rejected as outliers, not affecting system time at all
> unless the situation persists and prevents getting any good
> measurements for many hours.
>

I've read through all this and agree with it as well. Paging delays can
happen, but if they do who cares: the measurements will be rejected as
outliers for long delays, but might introduce some noise if the delay is on
the order of tens of milliseconds. Shorter may won't matter long term: they
will average out. Longer will definitely be rejected. And it will likely be
just for the first packet in the exchange since the code path will be
paged/swapped into the working set for that.  The loop is a combination of
PLL/FLL so that if we think there's a big phase step, we'll also think
there's a frequency error. Both are steered out the same way: by setting
the offset. But ntpd is wise enough to know that there will be noisy
measurements, so even if the measurements make it through the filters, we
only remove a portion of the error each polling interval anyway. Any over
or undershoot will be correct the next measurement interval. And if you
have so much memory pressure that ntpd is paged out every single
measurement interval, your system likely needs careful tuning anyway.

In days of yore (like the mid 90s), these defaults were setup when it took
tens or even hundreds of milliseconds to page in it mattered. Today those
numbers are submillisecond to single digit milliseconds, so in the typical
case the disruption is much less severe than it was when things were
initially locked into memory. I'm guessing that's why Linux has moved to -1
for default. In addition, ntpd's algorithms have improved somewhat since
then as well to cope with noise. These days, good ntpd performance is
submillisecond over the internet, so the noise has to be approximately on
that order to affect the filters in ntpd. On the LAN you're good to 10's of
microseconds, typically, so any noise > several 10's of microseconds would
be eliminated as an outlier anyway.  I strongly suspect careful
measurements will declare no difference in performance except, maybe, on
the most overloaded of servers (and then it would need to be extremely
overloaded to have a delay in scheduling often enough to matter).

For people that want to be sure, by all means lock it into memory. But as a
default, I'm extremely skeptical one could measure a difference, at all,
let alone measure a difference that matters.

Warner
Received on Mon Sep 09 2019 - 19:42:56 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:21 UTC