Re: ntpd segfaults on start

From: Ian Lepore <ian_at_freebsd.org> Date: Mon, 09 Sep 2019 15:12:18 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:21 UTC

On Mon, 2019-09-09 at 21:44 +0300, Konstantin Belousov wrote:
> On Mon, Sep 09, 2019 at 12:13:24PM -0600, Ian Lepore wrote:
> > On Mon, 2019-09-09 at 09:30 -0700, Rodney W. Grimes wrote:
> > > > On Sat, 2019-09-07 at 09:28 -0700, Cy Schubert wrote:
> > > > > In message <20190907161749.GJ2559_at_kib.kiev.ua>, Konstantin
> > > > > Belousov writes:
> > > > > > On Sat, Sep 07, 2019 at 08:45:21AM -0700, Cy Schubert
> > > > > > wrote:
> > > > > > > [...]
> > > 
> > > Doesn't locking this memory down also protect ntpd from OOM kills?
> > > If so that is a MUST preserve functionality, as IMHO killing ntpd
> > > on a box that has it configured is a total no win situation.
> > > 
> > 
> > Does it have that effect?  I don't know.  But I would argue that that's
> > a separate issue, and we should make that happen by adding
> > ntpd_oomprotect=YES to /etc/defaults/rc.conf
> 
> Wiring process memory has no effect on OOM selection. More, because
> all potentially allocated pages are allocated for real after mlockall(),
> the size of the vmspace, as accounted by OOM, is the largest possible
> size from the whole lifetime.
> 
> On the other hand, the code execution times are not predictable if the
> process's pages can be paged out. Under severe load next instruction
> might take several seconds or even minutes to start. It is quite unlike
> the scheduler delays. That introduces a jitter in the local time
> measurements and their usage as done in userspace. Wouldn't this affect
> the accuracy ?
> 

IMO, there is a large gap between "in theory, paging could cause
indeterminate delays in code execution" and "time will be inaccurate on
your system".  If there were a delay in a part of the code where it
matters that amounted to "seconds or even minutes", what you'd end up
with is a measurement that would be discarded by the median filter as
an outlier.  There would be some danger that if that kind of delay
happened for too many polling cycles in a row, you'd end up with no
usable measurements after a while and clock accuracy would suffer. 
Sub-second delays would be more worriesome because they might not be
rejected as outliers.

There are only a couple code paths in freebsd ntpd processing where a
paging (or scheduling) delay could cause measurement inaccuracy:

 - When stepping the clock, the code that runs between calling
clock_gettime() and calling clock_settime() to apply the step
adjustment to the clock.

 - When beginning an exchange with or replying to a peer, the code that
runs between obtaining system time for the outgoing Transmit Timestamp
and actually transmitting that packet.

Stepping the clock typically only happens once at startup.  The ntpd
code itself recognizes that this is a time-critical path (it has
comments to that effect) but unfortunately the code that runs is
scattered among several different .c files so it's hard to say what the
likelyhood is that code in the critical section will all be in the same
page (or be already-resident because other startup-time code faulted in
those pages).  IMO, the right fix for this would be a kernel interface
that let you apply a step-delta to the clock with a single syscall
(perhaps as an extension to the existing ntp_adjtime() using a new mode
flag).

On freebsd, the Receive timestamps are captured in the kernel and
delivered along with the packet to userland, and are retrieved by the
ntpd code from the SCM_BINTIME control message in the packet, so there
is no latency problem in the receive path.  There isn't a corresponding
kernel mechanism for setting the outgoing timestamps, so whether it's
originating a request to a peer or replying to a request from a peer,
the transmit timestamp could be wrong due to:

 - paging delays
 - scheduler delays
 - network stack, outgoing queues, and driver delays

So the primary vulnerability is on the transmit path between obtaining
system time and the packet leaving the system.  A quick glance at that
code makes me think that most of the data being touched has already
been referenced pretty recently during the process of assembling the
outgoing packet, so it's unlikely that storing the timestamp into the
outgoing packet or the other bit of work that happens after that
triggers a pagein unless the system is pathologically overloaded. 
Naturally, obtaining the timestamp and putting it into the packet is
one of the last things it does before sending, so the code path is
relatively short, but it's not clear to me whether it's likely or not
that the code involved all lives in the same page.  Still, it's one of
the heavily exercised paths within ntpd, which should increase the odds
of the pages being resident because of recent use.

So, I'm not disputing the point that a sufficiently overloaded system
can lead to an indeterminate delay between *any* two instructions
executed in userland.  What I've said above is more along the lines of
considering the usual situation, not the most pathlogical one.  In the
most pathological cases, either the delays introduced are fairly minor
and you get some minor jitter in system time (ameliorated by the median
filtering built in to ntpd), or the delays are major (a full second or
more) and get rejected as outliers, not affecting system time at all
unless the situation persists and prevents getting any good
measurements for many hours.

-- Ian