Re: ntpd segfaults on start

From: Konstantin Belousov <kostikbel_at_gmail.com> Date: Tue, 10 Sep 2019 09:09:49 +0300 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:21 UTC

On Mon, Sep 09, 2019 at 03:42:43PM -0600, Warner Losh wrote:
> On Mon, Sep 9, 2019 at 3:12 PM Ian Lepore <ian_at_freebsd.org> wrote:
> 
> > On Mon, 2019-09-09 at 21:44 +0300, Konstantin Belousov wrote:
> > > On Mon, Sep 09, 2019 at 12:13:24PM -0600, Ian Lepore wrote:
> > > > On Mon, 2019-09-09 at 09:30 -0700, Rodney W. Grimes wrote:
> > > > > > On Sat, 2019-09-07 at 09:28 -0700, Cy Schubert wrote:
> > > > > > > In message <20190907161749.GJ2559_at_kib.kiev.ua>, Konstantin
> > > > > > > Belousov writes:
> > > > > > > > On Sat, Sep 07, 2019 at 08:45:21AM -0700, Cy Schubert
> > > > > > > > wrote:
> > > > > > > > > [...]
> > > > >
> > > > > Doesn't locking this memory down also protect ntpd from OOM kills?
> > > > > If so that is a MUST preserve functionality, as IMHO killing ntpd
> > > > > on a box that has it configured is a total no win situation.
> > > > >
> > > >
> > > > Does it have that effect?  I don't know.  But I would argue that that's
> > > > a separate issue, and we should make that happen by adding
> > > > ntpd_oomprotect=YES to /etc/defaults/rc.conf
> > >
> > > Wiring process memory has no effect on OOM selection. More, because
> > > all potentially allocated pages are allocated for real after mlockall(),
> > > the size of the vmspace, as accounted by OOM, is the largest possible
> > > size from the whole lifetime.
> > >
> > > On the other hand, the code execution times are not predictable if the
> > > process's pages can be paged out. Under severe load next instruction
> > > might take several seconds or even minutes to start. It is quite unlike
> > > the scheduler delays. That introduces a jitter in the local time
> > > measurements and their usage as done in userspace. Wouldn't this affect
> > > the accuracy ?
> > >
> >
> > IMO, there is a large gap between "in theory, paging could cause
> > indeterminate delays in code execution" and "time will be inaccurate on
> > your system".  If there were a delay in a part of the code where it
> > matters that amounted to "seconds or even minutes", what you'd end up
> > with is a measurement that would be discarded by the median filter as
> > an outlier.  There would be some danger that if that kind of delay
> > happened for too many polling cycles in a row, you'd end up with no
> > usable measurements after a while and clock accuracy would suffer.
> > Sub-second delays would be more worriesome because they might not be
> > rejected as outliers.
> >
> > There are only a couple code paths in freebsd ntpd processing where a
> > paging (or scheduling) delay could cause measurement inaccuracy:
> >
> >  - When stepping the clock, the code that runs between calling
> > clock_gettime() and calling clock_settime() to apply the step
> > adjustment to the clock.
> >
> >  - When beginning an exchange with or replying to a peer, the code that
> > runs between obtaining system time for the outgoing Transmit Timestamp
> > and actually transmitting that packet.
> >
> > Stepping the clock typically only happens once at startup.  The ntpd
> > code itself recognizes that this is a time-critical path (it has
> > comments to that effect) but unfortunately the code that runs is
> > scattered among several different .c files so it's hard to say what the
> > likelyhood is that code in the critical section will all be in the same
> > page (or be already-resident because other startup-time code faulted in
> > those pages).  IMO, the right fix for this would be a kernel interface
> > that let you apply a step-delta to the clock with a single syscall
> > (perhaps as an extension to the existing ntp_adjtime() using a new mode
> > flag).
> >
> > On freebsd, the Receive timestamps are captured in the kernel and
> > delivered along with the packet to userland, and are retrieved by the
> > ntpd code from the SCM_BINTIME control message in the packet, so there
> > is no latency problem in the receive path.  There isn't a corresponding
> > kernel mechanism for setting the outgoing timestamps, so whether it's
> > originating a request to a peer or replying to a request from a peer,
> > the transmit timestamp could be wrong due to:
> >
> >  - paging delays
> >  - scheduler delays
> >  - network stack, outgoing queues, and driver delays
> >
> > So the primary vulnerability is on the transmit path between obtaining
> > system time and the packet leaving the system.  A quick glance at that
> > code makes me think that most of the data being touched has already
> > been referenced pretty recently during the process of assembling the
> > outgoing packet, so it's unlikely that storing the timestamp into the
> > outgoing packet or the other bit of work that happens after that
> > triggers a pagein unless the system is pathologically overloaded.
> > Naturally, obtaining the timestamp and putting it into the packet is
> > one of the last things it does before sending, so the code path is
> > relatively short, but it's not clear to me whether it's likely or not
> > that the code involved all lives in the same page.  Still, it's one of
> > the heavily exercised paths within ntpd, which should increase the odds
> > of the pages being resident because of recent use.
> >
> > So, I'm not disputing the point that a sufficiently overloaded system
> > can lead to an indeterminate delay between *any* two instructions
> > executed in userland.  What I've said above is more along the lines of
> > considering the usual situation, not the most pathlogical one.  In the
> > most pathological cases, either the delays introduced are fairly minor
> > and you get some minor jitter in system time (ameliorated by the median
> > filtering built in to ntpd), or the delays are major (a full second or
> > more) and get rejected as outliers, not affecting system time at all
> > unless the situation persists and prevents getting any good
> > measurements for many hours.
> >
> 
> I've read through all this and agree with it as well. Paging delays can
> happen, but if they do who cares: the measurements will be rejected as
> outliers for long delays, but might introduce some noise if the delay is on
> the order of tens of milliseconds. Shorter may won't matter long term: they
> will average out. Longer will definitely be rejected. And it will likely be
> just for the first packet in the exchange since the code path will be
> paged/swapped into the working set for that.  The loop is a combination of
> PLL/FLL so that if we think there's a big phase step, we'll also think
> there's a frequency error. Both are steered out the same way: by setting
> the offset. But ntpd is wise enough to know that there will be noisy
> measurements, so even if the measurements make it through the filters, we
> only remove a portion of the error each polling interval anyway. Any over
> or undershoot will be correct the next measurement interval. And if you
> have so much memory pressure that ntpd is paged out every single
> measurement interval, your system likely needs careful tuning anyway.
> 
> In days of yore (like the mid 90s), these defaults were setup when it took
> tens or even hundreds of milliseconds to page in it mattered. Today those
> numbers are submillisecond to single digit milliseconds, so in the typical
> case the disruption is much less severe than it was when things were
> initially locked into memory. I'm guessing that's why Linux has moved to -1
> for default. In addition, ntpd's algorithms have improved somewhat since
> then as well to cope with noise. These days, good ntpd performance is
> submillisecond over the internet, so the noise has to be approximately on
> that order to affect the filters in ntpd. On the LAN you're good to 10's of
> microseconds, typically, so any noise > several 10's of microseconds would
> be eliminated as an outlier anyway.  I strongly suspect careful
> measurements will declare no difference in performance except, maybe, on
> the most overloaded of servers (and then it would need to be extremely
> overloaded to have a delay in scheduling often enough to matter).
> 
> For people that want to be sure, by all means lock it into memory. But as a
> default, I'm extremely skeptical one could measure a difference, at all,
> let alone measure a difference that matters.

>From the overall system performance, I am all for stopping wiring ntpd.
One of the unfortunate consequences of doing that is the wiring of rtld,
libc, and libthr.

Small note, besides large delays caused by real pageouts (hard faults),
there are small jitter-like delays when pagedaemon emulates access and
dirty bits on machines which lack them by unmapping still resident
pages. The cost of reinstalling the pages is just the cost of locking
and calculating PTEs, but locking makes it dependent on other system
activities. I think that ARM is the largest suspect there.