BIND9 performance issues with SMP

From: JINMEI Tatuya / 職6柑巳6柑達哉 <jinmei_at_isl.rdc.toshiba.co.jp>
Date: Wed, 22 Dec 2004 08:51:00 +0900
Hello,

I was recently playing with FreeBSD 5.3's SMP kernel and BIND9 to
measure the response performance using multiple threads.  Perhaps this
is already well-known, but the result showed using threads with
FreeBSD 5.3 didn't improve the performance (rather, it actually
degraded the performance as we increased threads/CPUs).

In very short, it doesn't make sense to enable threading on FreeBSD in
any case (even with multiple CPUs).

I'm going to describe what I found in this experience in detail.  I
hope some of the followings contain new information in order to
improve FreeBSD's SMP support in general.

- tested environments
  OS: FreeBSD 5.3 beta 7 and RC1 (I believe the result should be the
      same with 5.3-RELEASE)
  Machine: Xeon 700MHz x 4 / Xeon 3000MHz x 4
  BIND version: 9.3.0, built with --enable-threads

- measurement description
  named loaded the root zone file (as of around May 2003).  I measured
  the response performance as query-per-second (qps) with the queryperf
  program (which comes with the BIND9 distributions).  queryperf asked
  various host names which are randomly generated (some of the host
  names result in NXDOMAIN).  All the numbers below show the resulting
  qps for a 30-second test.

- some general observations from the results

  1. BIND 9.3.0 does not create worker threads with the
     PTHREAD_SCOPE_SYSTEM attribute.  For FreeBSD, this means
     different worker threads cannot run on different CPUs, so
     multi-threading doesn't help anything in terms of performance.

  2. generally, BIND9 requires lots of mutex locks to process a single
     DNS query, causing many lock contentions.  The contentions
     degrade response performance very much.  This is true to some
     extent for any OSes, but lock contentions seem particularly heavy
     on FreeBSD (see also item 4 below).

  3. the SMP support in the kernel generally performs well in terms of
     UDP input/output on a single socket.  However, the kernel uses a
     giant lock for socket send buffer in the sosend() function, which
     can be a significant performance bottleneck with high-performance
     CPUs (the bottleneck was not revealed with 700MHz processors, but
     did appear with 3000MHz CPUs).  It seems to me we can safely
     avoid the bottleneck for DNS servers, since UDP output does not
     use socket send buffer.  I've made a quick-hack patch in the
     FreeBSD kernel and confirmed that this is the case.  (For those
     who are particularly interested in this patch, it's available at:
     http://www.jinmei.org/freebsd5.3-sosend-patch .
     A new socket option SO_FAST1 on a UDP socket enables the
     optimization)

  4. mutex contentions are VERY expensive (looks like much much more
     expensive than other OSes), while trying to get a lock without a
     contention is reasonably cheap.  (Almost) whenever a user thread
     blocks due to a lock contention, it is suspended with a system
     call (kse_release), probably causing context switch.  (I'm not
     really sure if the system call overhead is the main reason of the
     performance penalty though.)

  5. some standard libraries internally call pthread_mutex_lock(),
     which can also make the server slow due to the expensive
     contention tax.  Regarding BIND9, malloc() and arc4random() can
     be a heavy bottleneck (the latter is called for every query if we
     use the "random" order for RRsets).

  6. at least so far, the ULE scheduler doesn't help improve the
     performance (it even performs worse than the normal 4BSD
     scheduler).

- experiments with possible optimizations

Based on the above results, I've explored some optimizations to
improve the performance.  The first-level optimization is to create
worker threads with PTHREAD_SCOPE_SYSTEM and to avoid using malloc(3)
in the main part of query processing.  Let's call this version
"BIND+".  I also tried eliminating any possible mutex contentions in
the main part of query processing (it depends on some unrealistic
assumptions, so we cannot use this code in actual operation).  This
optimization is called "BIND++".  BIND++ also contains the
optimizations of BIND+.  Additionally, I've made a quick patch to the
kernel source code so that sosend() does not lock the socket send
buffer for some particular UDP packets.

The followings are the test results with these optimizations:

  A. tests with FreeBSD 5.3 beta 7 on Xeon 700MHz x 4

  threads BIND BIND+ BIND++
  0	  4818             
  1       3021  3390   4474
  2       1859  2496   7781
  3        986  1450  10615
  4        774  1167  12668

  Note: "BIND" is pure BIND 9.3.0.  "0 threads" mean the result
  without threading.  Numbers in the table body show the resulting
  qps's.

  While 9.3.0+ ran much better than pure 9.3.0, it still performed
  quite poorly.  However, we can achieve the real benefit of
  multi-threading/CPUs with BIND++.  This result shows if we can
  control mutex contentions in BIND9 by some realistic way, BIND can
  run faster on multiple CPUs with FreeBSD.

  B. tests with FreeBSD 5.3 RC1 on Xeon 3000MHz x 4

  threads  BIND  BIND++ BIND++
  0       16253         kernel_patch
  1        7953   14600        14438
  2        3591   19840        23854
  3        1012   24268        30268
  4         533   25447        30434

  Note: "BIND++kernel_patch" means BIND++ with the kernel optimization
  I mentioned in item 3 above.

  The results show even the full optimization in the application side
  is not enough with high-speed CPUs.  Further kernel optimization can
  help in this area.  The performance was still saturated with around
  4 CPUs.  I could not figure out the reason at that time.

  C. (for comparison) SuSE Linux (kernel 2.6.4, glibc 2.3.3) on the
      same box I used with experiment B

  threads  BIND  BIND++
  0        16117
  1        13707  17835
  2        16493  26946
  3        16478  32688
  4        14517  36090

  While "pure BIND9" does not provide better performance with multiple
  CPUs either (and the optimizations in BIND++ are equally effective),
  the penalty with multiple threads is much smaller.  I guess this is
  because Linux handles lock contentions much better than FreeBSD.

					JINMEI, Tatuya
					Communication Platform Lab.
					Corporate R&D Center, Toshiba Corp.
					jinmei_at_isl.rdc.toshiba.co.jp
Received on Tue Dec 21 2004 - 22:50:54 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:24 UTC