Re: Nagios SIGSEGV on FreeBSD 8

From: Ian FREISLICH <ianf_at_clue.co.za> Date: Thu, 08 Oct 2009 15:08:34 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:56 UTC

Tim Kientzle wrote:
> Scott Lambert wrote:
> > I've posted this to FreeBSD-ports and Nagios-Users without a nibble.  
> >
> > [New Thread 28326280 (LWP 100051)]
> > [New Thread 28301140 (LWP 100222)]
> > (gdb) bt
> > #0  0x0807fe8b in get_next_comment_by_host ()
> > #1  0x08080940 in delete_host_acknowledgement_comments ()
> > #2  0x28331180 in ?? ()
> > #3  0x4aaac053 in ?? ()
> > #4  0x080cc394 in __JCR_LIST__ ()
> 
> Build with debug symbols and try again; maybe you can get
> more detail.  Also, check a couple of core dumps to
> see if it's crashing in the same place; that might
> also give a clue.
> 
> Do the "New Thread" messages mean that Nagios is running
> multiple threads?  If so, I wonder what the other
> thread is doing?
> 

We've been trying to combat a performance issue in Nagios.  One
thread handles incoming events (nsca etc) and data from the nagios.cmd
pipe file and writes files for processing in
/var/spool/nagios/checkresults.  The other thread processes these
files and updates the host state and other data.

The threading broke profiling (I think) because when Nagios was
compiled with -pg it did no more than read its configuration, but
this alone was a pointer to the area that Nagios is poorly optimised
- string processing.  Reading our configuration resulted in 65000000
calls to strcmp. 65 Million!!

We're battling to keep up with passive events from about 5000 hosts
every few minutes.  The nagios.cmd thread struggles to keep up
reading from the fifo when there a about 4000 writers.  And the
worker thread struggles parse the checkresults files - they're big,
but not *that* big, maybe 80k to 120k lines which it takes about 7
minutes to parse.  We also had to up kern.maxusers="1024" and
kern.ipc.nmbclusters="131072" to prevent the system starving network
resources.

Ian

--
Ian Freislich