Re: BIND9 performance issues with SMP

From: Scott Long <scottl_at_freebsd.org> Date: Tue, 21 Dec 2004 17:17:32 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:25 UTC

JINMEI Tatuya / 神明達哉 wrote:

> Hello,
> 
> I was recently playing with FreeBSD 5.3's SMP kernel and BIND9 to
> measure the response performance using multiple threads.  Perhaps this
> is already well-known, but the result showed using threads with
> FreeBSD 5.3 didn't improve the performance (rather, it actually
> degraded the performance as we increased threads/CPUs).
> 
> In very short, it doesn't make sense to enable threading on FreeBSD in
> any case (even with multiple CPUs).
> 
> I'm going to describe what I found in this experience in detail.  I
> hope some of the followings contain new information in order to
> improve FreeBSD's SMP support in general.
> 
> - tested environments
>   OS: FreeBSD 5.3 beta 7 and RC1 (I believe the result should be the
>       same with 5.3-RELEASE)
>   Machine: Xeon 700MHz x 4 / Xeon 3000MHz x 4
>   BIND version: 9.3.0, built with --enable-threads
> 
> - measurement description
>   named loaded the root zone file (as of around May 2003).  I measured
>   the response performance as query-per-second (qps) with the queryperf
>   program (which comes with the BIND9 distributions).  queryperf asked
>   various host names which are randomly generated (some of the host
>   names result in NXDOMAIN).  All the numbers below show the resulting
>   qps for a 30-second test.
> 
> - some general observations from the results
> 
>   1. BIND 9.3.0 does not create worker threads with the
>      PTHREAD_SCOPE_SYSTEM attribute.  For FreeBSD, this means
>      different worker threads cannot run on different CPUs, so
>      multi-threading doesn't help anything in terms of performance.

This isn't really true.  All it means is that each thread will use 
whatever scheduler activation is available to the UTS at the time 
instead of having its own dedicated scheduler activation.  The whole 
theory behind SA/KSE is that scheduling threads from the userland should 
be cheaper than from the kernel, and SA provides the benefit of making 
more than 1 scheduling resource available to the userland scheduler,
unlike libc_r.  In practice, it looks like something broke in between
5.2 and 5.3 where process scope threads behave very strangely, almost 
like the UTS is acting like it only has one scheduling resource to work 
with.  The result is that performance degrades to that of libc_r, except
that threads that block in the kernel don't block the whole process.

I keep on hoping that Dan or Julian or David will have time to look at 
this, but that hasn't come to be yet, unfortunately.  I'd consider it a 
very high-priority bug, though.

> 
>   2. generally, BIND9 requires lots of mutex locks to process a single
>      DNS query, causing many lock contentions.  The contentions
>      degrade response performance very much.  This is true to some
>      extent for any OSes, but lock contentions seem particularly heavy
>      on FreeBSD (see also item 4 below).
> 
>   3. the SMP support in the kernel generally performs well in terms of
>      UDP input/output on a single socket.  However, the kernel uses a
>      giant lock for socket send buffer in the sosend() function, which
>      can be a significant performance bottleneck with high-performance
>      CPUs (the bottleneck was not revealed with 700MHz processors, but
>      did appear with 3000MHz CPUs).  It seems to me we can safely
>      avoid the bottleneck for DNS servers, since UDP output does not
>      use socket send buffer.  I've made a quick-hack patch in the
>      FreeBSD kernel and confirmed that this is the case.  (For those
>      who are particularly interested in this patch, it's available at:
>      http://www.jinmei.org/freebsd5.3-sosend-patch .
>      A new socket option SO_FAST1 on a UDP socket enables the
>      optimization)

Very interesting!

> 
>   4. mutex contentions are VERY expensive (looks like much much more
>      expensive than other OSes), while trying to get a lock without a
>      contention is reasonably cheap.  (Almost) whenever a user thread
>      blocks due to a lock contention, it is suspended with a system
>      call (kse_release), probably causing context switch.  (I'm not
>      really sure if the system call overhead is the main reason of the
>      performance penalty though.)

This might be related to what I said above.  Where you observing this
with process scope or system scope threads?  Again, if scheduling
decisions are not cheap in the UTS then there really is no point to
SA/KSE.

> 
>   5. some standard libraries internally call pthread_mutex_lock(),
>      which can also make the server slow due to the expensive
>      contention tax.  Regarding BIND9, malloc() and arc4random() can
>      be a heavy bottleneck (the latter is called for every query if we
>      use the "random" order for RRsets).
> 
>   6. at least so far, the ULE scheduler doesn't help improve the
>      performance (it even performs worse than the normal 4BSD
>      scheduler).

With both types of threads?  Have you tried Jeff's recent fixes to ULE?
Unfortunately we saw similar performance problems over the summer, and
that contributed to switching off of the ULE scheduler.  Hopefully this
situation improves.

> 
> - experiments with possible optimizations
> 
> Based on the above results, I've explored some optimizations to
> improve the performance.  The first-level optimization is to create
> worker threads with PTHREAD_SCOPE_SYSTEM and to avoid using malloc(3)
> in the main part of query processing.  Let's call this version
> "BIND+".  I also tried eliminating any possible mutex contentions in
> the main part of query processing (it depends on some unrealistic
> assumptions, so we cannot use this code in actual operation).  This
> optimization is called "BIND++".  BIND++ also contains the
> optimizations of BIND+.  Additionally, I've made a quick patch to the
> kernel source code so that sosend() does not lock the socket send
> buffer for some particular UDP packets.
> 
> The followings are the test results with these optimizations:
> 
>   A. tests with FreeBSD 5.3 beta 7 on Xeon 700MHz x 4
> 
>   threads BIND BIND+ BIND++
>   0	  4818             
>   1       3021  3390   4474
>   2       1859  2496   7781
>   3        986  1450  10615
>   4        774  1167  12668
> 
>   Note: "BIND" is pure BIND 9.3.0.  "0 threads" mean the result
>   without threading.  Numbers in the table body show the resulting
>   qps's.
> 
>   While 9.3.0+ ran much better than pure 9.3.0, it still performed
>   quite poorly.  However, we can achieve the real benefit of
>   multi-threading/CPUs with BIND++.  This result shows if we can
>   control mutex contentions in BIND9 by some realistic way, BIND can
>   run faster on multiple CPUs with FreeBSD.
> 
>   B. tests with FreeBSD 5.3 RC1 on Xeon 3000MHz x 4
> 
>   threads  BIND  BIND++ BIND++
>   0       16253         kernel_patch
>   1        7953   14600        14438
>   2        3591   19840        23854
>   3        1012   24268        30268
>   4         533   25447        30434
> 
>   Note: "BIND++kernel_patch" means BIND++ with the kernel optimization
>   I mentioned in item 3 above.
> 
>   The results show even the full optimization in the application side
>   is not enough with high-speed CPUs.  Further kernel optimization can
>   help in this area.  The performance was still saturated with around
>   4 CPUs.  I could not figure out the reason at that time.
> 
>   C. (for comparison) SuSE Linux (kernel 2.6.4, glibc 2.3.3) on the
>       same box I used with experiment B
> 
>   threads  BIND  BIND++
>   0        16117
>   1        13707  17835
>   2        16493  26946
>   3        16478  32688
>   4        14517  36090
> 
>   While "pure BIND9" does not provide better performance with multiple
>   CPUs either (and the optimizations in BIND++ are equally effective),
>   the penalty with multiple threads is much smaller.  I guess this is
>   because Linux handles lock contentions much better than FreeBSD.
> 

Do you have any comparisons to NetBSD or Solaris?  Comparing to Linux
often results in comparing apples to oranges since there is
long-standing suspicion that Linux cuts corners where BSD does not.
Also, would you be able to re-run your tests using the THR thread
package?

Scott