Hi, I have just finished some profiling and analysis of the FREEBSD_5_BP code running a standard 4-port ethernet bridge (not netgraph). On the upside, some of the features such as the netperf stuff, MUTEX_PROFILING and UMA are very cool, and (I think) give the potential for a really fast bridge (or similar application). However, the current performance is still rather poor compared to 4.x, but I think that with the groundwork now in place, and some minor changes and a couple of new features, it can be made much much faster. I would like to discuss some possible optimizations (will suggest some below), and then we are willing to take on some of them, and give the code back to FreeBSD. Hopefully these changes can be made on RELENG_5 to be used by by 5.4. The tests that I have run so far have focussed on the different between running in polling mode (dual 2.8Ghz Xeon, 2 2-port em NICs) versus interrupt mode (with debug.mpsafenet=1, and no INVARIANTS/WITNESS or anything like that). In both setups I actually get similar throughput (300kpps total in and out divided evenly over the 4 ports). I think it should be possible to get >> 1Mpps bridging on this platform. In the polling case, there is still only one active thread, and the limiting factor seems to be simply the number of mutexes (11 per packet according to MUTEX_PROFILING), and overhead from UMA, bus_dma, etc. With polling disabled, I think the fact that PREEMPTION was disabled (I can't even boot with it on), and some sub-optimal mutex usage resulting in a lot of collisions caused some problems, even though in theory all 4 cores should be able to run simultaneously. Here is a sample profile (while in polling mode). The cpu idle, halt etc are simply indicating that 3 of the cores have nothing to do. But it does give a pretty good sense of where all time is being spent. There are definitely a lot of cycles going to UMA, mutexes, etc. (This profile only shows the top functions, and has the call tree disabled ... ie only interrupt based profiling because it slows the test down too much otherwise). % cumulative self self total time seconds seconds calls ms/call ms/call name 18.4 10.25 10.25 cpu_idle_default [1] 13.8 17.94 7.69 cpu_idle [2] 6.5 21.57 3.63 critical_exit [3] 6.5 25.17 3.61 _mtx_lock_spin [4] 5.0 27.95 2.78 uma_zalloc_arg [5] 4.6 30.52 2.56 cpu_halt [6] 4.4 32.94 2.43 uma_zfree_arg [7] 3.9 35.12 2.18 maybe_preempt [8] 3.2 36.91 1.79 bridge_in [9] 2.8 38.46 1.55 em_process_receive_interrupts [10] 2.6 39.89 1.43 _bus_dmamap_load_buffer [11] 2.3 41.19 1.30 bdg_forward [12] 2.3 42.48 1.29 mb_free_ext [13] 1.8 43.49 1.01 malloc_type_freed [14] 1.7 44.44 0.95 ether_input [15] 1.7 45.39 0.94 em_start [16] 1.7 46.33 0.94 _bus_dmamap_sync [17] 1.5 47.18 0.84 em_start_locked [18] 1.2 47.85 0.68 malloc_type_zone_allocated [19] 1.2 48.52 0.67 __mcount [20] 1.2 49.17 0.65 mb_ctor_pack [21] 1.1 49.80 0.63 em_encap [22] 1.1 50.39 0.59 free [23] 1.0 50.94 0.56 bus_dmamap_load_mbuf [24] 0.9 51.46 0.51 generic_bzero [25] 0.9 51.96 0.50 m_freem [26] 0.8 52.42 0.46 generic_bcopy [27] 0.7 52.79 0.38 em_get_buf [28] 0.6 53.13 0.34 em_clean_transmit_interrupts [29] 0.5 53.42 0.29 bus_dmamap_load [30] 0.4 53.66 0.24 m_adj [31] 0.4 53.90 0.23 malloc [32] 0.4 54.11 0.22 bus_dmamap_create [33] 0.2 54.24 0.12 bus_dmamem_free [35] 0.2 54.35 0.11 mb_dtor_pack [36] 0.2 54.45 0.10 em_tx_cb [37] 0.2 54.54 0.09 em_receive_checksum [38] 0.1 54.61 0.08 em_dmamap_cb [39] 0.1 54.69 0.07 m_tag_delete_chain [40] 0.1 54.75 0.07 _bus_dmamap_unload [41] 0.1 54.82 0.06 em_poll [42] 0.1 54.88 0.06 em_transmit_checksum_setup [43] 0.1 54.93 0.05 bus_dmamap_destroy [44] 0.1 54.97 0.04 _mtx_lock_sleep [47] 0.1 55.00 0.03 if_start [49] 0.1 55.03 0.03 bus_dmamap_load_uio [50] 0.1 55.07 0.03 75189 0.00 0.00 netisr_poll [51] 0.1 55.10 0.03 em_smartspeed [52] 0.1 55.13 0.03 ithread_loop [34] Here are the (top) results of the mutex profiling (these are basically all the locks that get called once or twice per packet): max total count avg cnt_hold cnt_lock name 24344 37552473 309134 121 151712 101781 if_em.c:956 (em5) (1) 31578 10548396 309131 34 44233 81751 if_em.c:3432 (em4) (2) 460 5813698 620705 9 16 79 uma_core.c:1800 (UMA pcpu) (3) 428 4304975 619846 6 26 24 uma_core.c:2206 (UMA pcpu) (4) 445 3129168 309127 10 30828 28115 bridge.c:1201 (em5) (5) 462 3125131 309127 10 125294 122560 bridge.c:816 (bridge) (6) 489 2815715 309134 9 14610 20050 if_em.c:926 (em5) (7) 450 2573019 309170 8 94471 101577 kern_malloc.c:185 (devbuf) (8) 419 2113089 309275 6 67982 65871 kern_malloc.c:210 (devbuf) (9) The line numbers will be close to RELENG_5_BP code but not exactly the same because of some local modifications, so here are the descriptions of the mutexes involved: 1) em_start (used for transmit) 2) em_process_receive_interrupts (re-lock just after if_input) 3) uma_zalloc_arg (per CPU lock) 4) uma_zfree_arg (per CPU lock) 5) bdb_forward (IFQ_HANDOFF) 6) bridge_in (global bridge lock) 7) em_start_locked (IF_DEQUEUE) 8) malloc_type_zone_allocated 9) malloc_type_freed >From these numbers, the uma locks seem to get called twice for every packet, but have no collisions. All other locks have significant collision problems resulting in a lot of overhead. Based on these stats, I have come up with the following observations/suggestions/etc that I would like to discuss. As discussed before, there is a significant cost associated with every mutex. I'd like to be able to get down to less than 1 mutex per packet (on average) through this path. Some of the possibilities to do this are: - Implement workQ's of packets (also suggested by Robert Watson in the past). This will reduce the mutexes in number 1, 2, 5, 6 & 7 above because it should be possible to only take the lock for a queue of packets, instead of every one. - Implement device level caching for the UMA mbuf zones. If a driver could allocate one bucket of mbufs at a time, no locking would be required per allocation. The same goes for the free side of things, if you can allocate an empty bucket, fill it up, and then return it, only a couple of mutexes are required per bucket. This would also reduce the function call overhead for every packet. This change should actually get rid of most of the remaining mutex overhead. I think that one of the major reasons that polling with one thread had about the same performance as interrupts with 4 threads/cores is that some of the mutexes are held far too long, thus reducing parallelism. The biggest culprit of this is in the em driver. First of all, there is only one global lock for the driver, but there should be no reason that the rx & tx paths couldn't be run simultanously. If we setup something like: EM_TX_LOCK() EM_TX_UNLOCK() EM_RX_LOCK() EM_RX_UNLOCK() EM_LOCK() {EM_TX_LOCK(); EM_RX_LOCK()} EM_UNLOCK() {EM_TX_UNLOCK(); EM_RX_UNLOCK()} this driver will run much faster. Even within the receive and transmit functions, the mutexes are held for a long time. It should be possible to code in such a way that the mutex is released before trying to free or allocate an mbuf. This should reduce the holding time and thus collisions a lot. When overloading the bridge in interrupt mode, the system becomes completely unresponsive (can't even get into ddb) until the packet source is removed. This is highly undesirable behaviour, but currently the only way to use multiple kernel threads to handle the workload. Extending polling to use multiple threads instead of one should work around this problem. This is a bit of a design in itself, and probably worthy of a separate discussion. We are certainly willing to give this a shot (hopefully with with some external input) The latest generation Xeons (Nocona) have a couple of new features that are very useful for optimizing code. One of them is the ability to prefetch a cache line for which a page is not yet in the tlb. It should be possible to strategically sprinkle a few prefetches in the code, and get a big performance boost. This is probably pretty platform specific though, so I don't know how to do this in general because it will only benefit some platforms (don't know about AMD/alpha), and may slightly hurt some others. In terms of cache efficiency, I am not sure that using the UMA mbuf packet zone is the best way to go. To be able to put a cluster on a DMA descriptor, you currently need to read the mbuf header to get its pointer. It may be more efficient to have the local cache of just clusters and mbufs. To allocate a cluster you just need to read the bucket array, and can add the cluster to the descriptor without having anything but the array itself in cache. Once the packet is filled up, it can be coupled to an mbuf header. The other advantage of this is that pointers for both are always easily available in an array, they lend themselves well to s/w prefetching. The choice of schedulers, and use of PREEMPTION will probably make a bit of a difference for these tests too, but I did not do much experimentation because I couldn't even boot with the ULE scheduler & PREEMPTION enabled. I suspect that preemption will help quite a bit when there are mutex collisions. This is all I have for now. As I mentioned previously, I'd like to generate some discussion on some of these points, as well as hear ideas for additional optimizations. We will definitely implement some of these features ourselves, but would much rather give back the code and make this a "cooperative effort". Also, I haven't done any testing on the netgraph side of things yet, but that will probably be next on the list. Comments? Thanks, Gerrit NagelhoutReceived on Wed Sep 08 2004 - 20:13:18 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:11 UTC