Re: ntpd dies nightly on a server with jails

From: Ian Lepore <ian_at_freebsd.org> Date: Fri, 17 Mar 2017 12:20:15 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:10 UTC

On Fri, 2017-03-17 at 18:05 +0100, O. Hartmann wrote:
> Am Wed, 15 Mar 2017 13:12:37 -0700
> Cy Schubert <Cy.Schubert_at_komquats.com> schrieb:
> 
> > 
> > Hi O.Hartmann,
> > 
> > I'll try to answer as much as I can in the noon hour I have left.
> > 
> > In message <20170315071724.78bb0bdc_at_freyja.zeit4.iv.bundesimmobilie
> > n.de>, 
> > "O. H
> > artmann" writes:
> > > 
> > > Running a host with several jails on recent CURRENT (12.0-CURRENT 
> > > #8 r315187:
> > > Sun Mar 12 11:22:38 CET 2017 amd64) makes me trouble on a daily
> > > basis.
> > > 
> > > The box is an older two-socket Fujitsu server equipted with two
> > > four-core
> > > Intel(R) Xeon(R) CPU L5420  _at_ 2.50GHz.
> > > 
> > > The box has several jails, each jail does NOT run service ntpd.
> > > Each jail has
> > > its dedicated loopback, lo1 throughout lo5 (for the moment) with
> > > dedicated IP
> > > :
> > > 127.0.1.1 - 127.0.5.1 (if this matter, I believe not).
> > > 
> > > The host itself has two main NICs, broadcom based. bcm0 is
> > > dedicated to the
> > > host, bcm1 is shared amongst the jails: each jail has an IP bound
> > > to bcm1 via
> > > whihc the jails communicate with the network.
> > > 
> > > I try to capture log informations via syslog, but FreeBSD's ntpd
> > > seems to be
> > > very, very sparse with such informations, coverging to null - I
> > > can't see
> > > anything suiatble in the logs why NTPD dies almost every night
> > > leaving the
> > > system with a wild reset of time. Sometimes it is a gain of 6
> > > hours, sometime
> > > s
> > > it is only half an hour. I leave the box at 16:00 local time
> > > usually and take
> > > care again at ~ 7 o'clock in the morning local time.  
> > We will need to turn on debugging. Unfortunately debug code is not
> > compiled 
> > into the binary. We have two options. You can either update 
> > src/usr.sbin/ntp/config.h to enable DEBUG or build the port (it's
> > the exact 
> > same ntp) with the DEBUG option -- this is probably simpler. Then
> > enable 
> > debug with -d and -D. -D increases verbosity. I just committed a
> > debug 
> > option to both ntp ports to assist here.
> > 
> > Next question: Do you see any indication of a core dump? I'd be
> > interested 
> > in looking at it if possible.
> > 
> > > 
> > > 
> > > When the clock is floating that wild, in all cases ntpd isn't
> > > running any mor
> > > e.
> > > I try to restart with options -g and -G to adjust the time
> > > quickly at the
> > > beginning, which works fine.  
> > This is disconcerting. If your clock is floating wildly without
> > ntpd 
> > running there are other issues that might be at play here. At most
> > the 
> > clock might drift a little, maybe a minute or two a day but not by
> > a lot. 
> > Does the drift cause your clocks to run fast or slow?
> > 
> > > 
> > > 
> > > Apart from possible misconfigurations of the jails (I'm quite new
> > > to jails an
> > > d
> > > their pitfalls), I was wondering what causes ntpd to die. i can't
> > > determine
> > > exactly the time of its death, so it might be related to
> > > diurnal/periodic
> > > processes (I use only the most vanilla configurations on
> > > periodic, except for
> > > checking ZFS's scrubbing enabled).  
> > As I'm a little rushed for time, I didn't catch whether the jails 
> > themselves were also running ntpd... just thought I'd ask. I don't
> > see how 
> > zfs scrubbing or any other periodic scripts could cause this.
> > 
> > > 
> > > 
> > > I'ven't had the chance to check whether the hardware is
> > > completely all right,
> > > but from a superficial point of view there is no issue with high
> > > gain of the
> > > internal clock or other hardware issues.  
> > It's probably a good idea to check. I don't think that would cause
> > ntpd any 
> > gas. I've seen RTC battery messages on my gear which haven't caused
> > ntpd 
> > any problem. I have two machines which complain about RTC battery
> > being 
> > dead, where in fact I have replaced the batteries and the messages
> > still 
> > are displayed at boot. I'm not sure if it's possible for a kernel
> > to damage 
> > the RTC. In my case that doesn't cause ntpd any problems. It's
> > probably 
> > good to check anyway.
> > 
> > > 
> > > 
> > > If there are known issues with jails (the problem occurs since I
> > > use those),
> > > advice is appreciated.  
> > Not that I know of.
> > 
> > 
> Just some strange news:
> 
> I left the server the whole day with ntpd disabled and I didn't watch
> a gain of the RTC
> by one second, even stressing the machine.
> 
> But soon after restarting ntpd, I realised immediately a 30 minutes
> off! This morning,
> the discrapancy was almost 5 hours - it looked more like a weird
> ajustment to another
> time base than UTC.
> 
> Over the weekend I'll leave the server with ntpd disabled and only
> RTC running. I've the
> strange feeling that something is intentionally readjusting the ntpd
> time due to a
> misconfiguration or a rogue ntp server in the X.CC.pool.ntp.org
> 

The rogue server theory is a bad one, unless you have configured just a
single server in your ntp.conf and it is the rogue.  Ntpd requires
agreement among the set of configured servers, it will ignore outliers.

It would help to have some actual data.  What does ntpq -p show right
after starting ntpd?  Then a few minutes later, then again 10 minutes
after that, etc.  What is in the /var/db/ntpd.drift file?  Are you
using the standard freebsd ntp.conf file as delivered, or have you
customized it?  Any non-default settings in your rc.conf related to
ntp?

-- Ian