JINMEI Tatuya / 神明達哉 wrote: > Hello, > > I was recently playing with FreeBSD 5.3's SMP kernel and BIND9 to > measure the response performance using multiple threads. Perhaps this > is already well-known, but the result showed using threads with > FreeBSD 5.3 didn't improve the performance (rather, it actually > degraded the performance as we increased threads/CPUs). > > In very short, it doesn't make sense to enable threading on FreeBSD in > any case (even with multiple CPUs). > > I'm going to describe what I found in this experience in detail. I > hope some of the followings contain new information in order to > improve FreeBSD's SMP support in general. > > - tested environments > OS: FreeBSD 5.3 beta 7 and RC1 (I believe the result should be the > same with 5.3-RELEASE) > Machine: Xeon 700MHz x 4 / Xeon 3000MHz x 4 > BIND version: 9.3.0, built with --enable-threads > > - measurement description > named loaded the root zone file (as of around May 2003). I measured > the response performance as query-per-second (qps) with the queryperf > program (which comes with the BIND9 distributions). queryperf asked > various host names which are randomly generated (some of the host > names result in NXDOMAIN). All the numbers below show the resulting > qps for a 30-second test. > > - some general observations from the results > > 1. BIND 9.3.0 does not create worker threads with the > PTHREAD_SCOPE_SYSTEM attribute. For FreeBSD, this means > different worker threads cannot run on different CPUs, so > multi-threading doesn't help anything in terms of performance. This isn't really true. All it means is that each thread will use whatever scheduler activation is available to the UTS at the time instead of having its own dedicated scheduler activation. The whole theory behind SA/KSE is that scheduling threads from the userland should be cheaper than from the kernel, and SA provides the benefit of making more than 1 scheduling resource available to the userland scheduler, unlike libc_r. In practice, it looks like something broke in between 5.2 and 5.3 where process scope threads behave very strangely, almost like the UTS is acting like it only has one scheduling resource to work with. The result is that performance degrades to that of libc_r, except that threads that block in the kernel don't block the whole process. I keep on hoping that Dan or Julian or David will have time to look at this, but that hasn't come to be yet, unfortunately. I'd consider it a very high-priority bug, though. > > 2. generally, BIND9 requires lots of mutex locks to process a single > DNS query, causing many lock contentions. The contentions > degrade response performance very much. This is true to some > extent for any OSes, but lock contentions seem particularly heavy > on FreeBSD (see also item 4 below). > > 3. the SMP support in the kernel generally performs well in terms of > UDP input/output on a single socket. However, the kernel uses a > giant lock for socket send buffer in the sosend() function, which > can be a significant performance bottleneck with high-performance > CPUs (the bottleneck was not revealed with 700MHz processors, but > did appear with 3000MHz CPUs). It seems to me we can safely > avoid the bottleneck for DNS servers, since UDP output does not > use socket send buffer. I've made a quick-hack patch in the > FreeBSD kernel and confirmed that this is the case. (For those > who are particularly interested in this patch, it's available at: > http://www.jinmei.org/freebsd5.3-sosend-patch . > A new socket option SO_FAST1 on a UDP socket enables the > optimization) Very interesting! > > 4. mutex contentions are VERY expensive (looks like much much more > expensive than other OSes), while trying to get a lock without a > contention is reasonably cheap. (Almost) whenever a user thread > blocks due to a lock contention, it is suspended with a system > call (kse_release), probably causing context switch. (I'm not > really sure if the system call overhead is the main reason of the > performance penalty though.) This might be related to what I said above. Where you observing this with process scope or system scope threads? Again, if scheduling decisions are not cheap in the UTS then there really is no point to SA/KSE. > > 5. some standard libraries internally call pthread_mutex_lock(), > which can also make the server slow due to the expensive > contention tax. Regarding BIND9, malloc() and arc4random() can > be a heavy bottleneck (the latter is called for every query if we > use the "random" order for RRsets). > > 6. at least so far, the ULE scheduler doesn't help improve the > performance (it even performs worse than the normal 4BSD > scheduler). With both types of threads? Have you tried Jeff's recent fixes to ULE? Unfortunately we saw similar performance problems over the summer, and that contributed to switching off of the ULE scheduler. Hopefully this situation improves. > > - experiments with possible optimizations > > Based on the above results, I've explored some optimizations to > improve the performance. The first-level optimization is to create > worker threads with PTHREAD_SCOPE_SYSTEM and to avoid using malloc(3) > in the main part of query processing. Let's call this version > "BIND+". I also tried eliminating any possible mutex contentions in > the main part of query processing (it depends on some unrealistic > assumptions, so we cannot use this code in actual operation). This > optimization is called "BIND++". BIND++ also contains the > optimizations of BIND+. Additionally, I've made a quick patch to the > kernel source code so that sosend() does not lock the socket send > buffer for some particular UDP packets. > > The followings are the test results with these optimizations: > > A. tests with FreeBSD 5.3 beta 7 on Xeon 700MHz x 4 > > threads BIND BIND+ BIND++ > 0 4818 > 1 3021 3390 4474 > 2 1859 2496 7781 > 3 986 1450 10615 > 4 774 1167 12668 > > Note: "BIND" is pure BIND 9.3.0. "0 threads" mean the result > without threading. Numbers in the table body show the resulting > qps's. > > While 9.3.0+ ran much better than pure 9.3.0, it still performed > quite poorly. However, we can achieve the real benefit of > multi-threading/CPUs with BIND++. This result shows if we can > control mutex contentions in BIND9 by some realistic way, BIND can > run faster on multiple CPUs with FreeBSD. > > B. tests with FreeBSD 5.3 RC1 on Xeon 3000MHz x 4 > > threads BIND BIND++ BIND++ > 0 16253 kernel_patch > 1 7953 14600 14438 > 2 3591 19840 23854 > 3 1012 24268 30268 > 4 533 25447 30434 > > Note: "BIND++kernel_patch" means BIND++ with the kernel optimization > I mentioned in item 3 above. > > The results show even the full optimization in the application side > is not enough with high-speed CPUs. Further kernel optimization can > help in this area. The performance was still saturated with around > 4 CPUs. I could not figure out the reason at that time. > > C. (for comparison) SuSE Linux (kernel 2.6.4, glibc 2.3.3) on the > same box I used with experiment B > > threads BIND BIND++ > 0 16117 > 1 13707 17835 > 2 16493 26946 > 3 16478 32688 > 4 14517 36090 > > While "pure BIND9" does not provide better performance with multiple > CPUs either (and the optimizations in BIND++ are equally effective), > the penalty with multiple threads is much smaller. I guess this is > because Linux handles lock contentions much better than FreeBSD. > Do you have any comparisons to NetBSD or Solaris? Comparing to Linux often results in comparing apples to oranges since there is long-standing suspicion that Linux cuts corners where BSD does not. Also, would you be able to re-run your tests using the THR thread package? ScottReceived on Wed Dec 22 2004 - 01:11:56 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:25 UTC