Segmentation fault running ntpd

From: David Wolfskill <david_at_catwhisker.org>
Date: Sat, 18 Jul 2015 05:09:56 -0700
Lousy timing (no pun intended -- it's early in the day for me),
given the recent MFC, but as I was booting my laptop to yesterday's
head:

FreeBSD g1-245.catwhisker.org 11.0-CURRENT FreeBSD 11.0-CURRENT #127  r285652M/285652:1100077: Fri Jul 17 04:30:16 PDT 2015     root_at_g1-245.catwhisker.org:/common/S3/obj/usr/src/sys/CANARY  amd64

to build today's head (_at_r285670; still in progress as I type), I
happened to note [Oh, great -- we can no longer copy/paste from
console now??!?  Fine, I'll transcribe by hand.... :-(]:

...
bound to 172.17.1.245 -- renewal in 43200 seconds.
pid 544 (ntpd), uid 0: exited on signal 11 (core dumped)
Starting Network: lo0 em0 iwn0 lagg0.
...

Trying to examine the /ntpd.core, I see:
root_at_g1-245:/ # gdb `which ntpd` ntpd.core 
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...(no debugging symbols found)...
Core was generated by `ntpd'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.5
Reading symbols from /lib/libcrypto.so.7...(no debugging symbols found)...done.
Loaded symbols for /lib/libcrypto.so.7
Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols found)...done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x00000008011cd6a0 in sbrk () from /lib/libc.so.7
[New Thread 801c07400 (LWP 100122/<unknown>)]
[New Thread 801c06400 (LWP 100120/<unknown>)]
(gdb) bt
#0  0x00000008011cd6a0 in sbrk () from /lib/libc.so.7
#1  0x00000008ccbd4f34 in ?? ()
#2  0x0000000000000005 in ?? ()
#3  0x0000000801800448 in ?? ()
#4  0x00000008011ca888 in sbrk () from /lib/libc.so.7
#5  0x00000008018000c8 in ?? ()
#6  0x00000008018000c0 in ?? ()
#7  0x0000000000000208 in ?? ()
#8  0x0000000801c32fb0 in ?? ()
#9  0x0000000000000001 in ?? ()
#10 0x0000000801cc20c8 in ?? ()
#11 0x0000000000000030 in ?? ()
#12 0x0000000801cc20c8 in ?? ()
#13 0x00007fffffffe480 in ?? ()
#14 0x00000008011cd240 in sbrk () from /lib/libc.so.7
#15 0x0000000000000280 in ?? ()
#16 0x00000008014bbc70 in malloc_message () from /lib/libc.so.7
#17 0x00000008018000c0 in ?? ()
#18 0x0000000801800448 in ?? ()
#19 0x0000000000000032 in ?? ()
#20 0x0000000801800458 in ?? ()
#21 0x00000008014bbc68 in malloc_message () from /lib/libc.so.7
#22 0x0000000801cc2000 in ?? ()
---Type <return> to continue, or q <return> to quit---
#23 0x00000008014bba60 in malloc_message () from /lib/libc.so.7
#24 0x0000000801cc20d8 in ?? ()
#25 0x00000000000000a0 in ?? ()
#26 0x0000000000000208 in ?? ()
#27 0x00007fffffffe4d0 in ?? ()
#28 0x00000008011bdd7a in _malloc_thread_cleanup () from /lib/libc.so.7
Previous frame inner to this frame (corrupt stack?)
(gdb) 

which seems... well, not especially useful, as far as I can tell.


This is (as mentioned above) on my laptop; as such, it is expected to
"wander" from one network to another.  Accordingly:

* Since it could be connected to a network I do not control, I use a
  packet filter (IPFW, in my case) to reduce my exposure from a
  possibly-hostile network.

* Rather than enabling ntpd in /etc/rc.conf, I use
  /etc/dhclient-exit-hooks to start ntpd after the laptop has a DHCP
  lease.  (For networks I control, I also set up the DHCP server to
  advertise what NTP server the DHCP clients should use, but the code in
  dhclient-exit-hooks merely prefers that, rather han requiring it.)

* In my world-view -- at least for networks I control -- DNS zone files
  are the Source of Truth with respect to hostname <-> IP address
  correspondence, and Dynamic DNS is Evil.  I populate my zone files
  with appropriate A & PTR records so that every assignable DHCP
  address has a PTR record, and the hostname to which it points has
  an A record that points back to that IP address.  Accordingly, I
  also use /etc/dhclient-exit-hooks so the laptop can find out what
  its hostname is, and set it accordingly.

Mind, I've been doing the above for well over a decade, so that doesn't
qualify as "new."

And most of the time, it Just Works (which is a significant reason I
keep doing it).

A couple of other things that are more recent, and possibly of
relevance:

* As alluded to above, I have the em0 & wlan0 (iwn(4)) NICs set up using
  Link Aggregation in "failover" mode.  In practice, I rarely use
  the em0 (wired) NIC -- I had originally done that based on a
  misperception of how I thought things were set up at work, and
  then just left the configuration alone and relied on the wireless
  NIC.  (At home, I have things set up so that the failover would
  work, but doing so would be a little awkward for reasons that
  aren't relevant here.)

* I have the laptop configured to run xdm(1)... after the DHCP lease is
  acquired and the hostname is set.  My ~/.xsession script is set
  up so it fires up ssh-agent, requests a passphrase, and then
  (among other things) establishes an SSH session to the "mail hub"
  at home and re-establish a tmux session where I'm running mutt
  to handle my email.  I've noticed that in head, these connections
  sometimes fail to get initialized, and sometimes will time out,
  while sessions started a few minutes later will have no problem.
  That seems peculiar, but was sufficiently ... well, "nebulous" that
  I didn't think it warranted a whine of its own here.  But on the
  chance that it's related to ntpd giving up the ghost prematurely,
  it seemed but a reasonable exercise of "Full Disclosure" to mention
  it in this context -- even though it's also something I've been doing
  since the (late) 1990s.

So: Any suggestions for either diagnosing what the root cause is or
changing the configuration so that the failure no longer occurs?

Thanks!

Peace,
david
-- 
David H. Wolfskill				david_at_catwhisker.org
Those who murder in the name of God or prophet are blasphemous cowards.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.

Received on Sat Jul 18 2015 - 10:38:51 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:58 UTC