Re: 4.7 vs 5.2.1 SMP/UP bridging performance

From: Kenneth Culver <culverk_at_sweetdreamsracing.biz>
Date: Tue, 4 May 2004 16:21:10 -0400
Quoting Gerrit Nagelhout <gnagelhout_at_sandvine.com>:

> Hi,
>
> For one of our applications in our testlab, we are running bridge(4)
> with several user land applications.  I have found that the bridging
> performance (64 byte packets, 2-port bridge) on 5.2.1 is
> significantly lower than that of RELENG_4, especially when running in
> SMP.  The platform is a dual 2.8GHz xeon with a dual port em (100MHz
> PCI-X).  Invariants are disabled, and polling (with idle_polling
> enabled) is used.

Quick stupid question, did you turn off all the debugging stuff in the kernel:

options         DDB                     # Enable the kernel debugger
options         INVARIANTS              # Enable calls of extra sanity 
checking
options         INVARIANT_SUPPORT       # Extra sanity checks of 
internal struct
ures, required by INVARIANTS
options         WITNESS                 # Enable checks to detect 
deadlocks and
cycles
options         WITNESS_SKIPSPIN

If you didn't turn all of that off, you may want to try it.

Ken
>
> Here are the various test results (packets per second, full duplex)
> [traffic generator] <=> [FreeBSD bridge] <=> [traffic generator]
>
> 	4.7 UP:     1.2Mpps
> 	4.7 SMP :   1.2Mpps
> 	5.2.1 UP:   850Kpps
> 	5.2.1 SMP:  500Kpps
>
> I believe that for RELENG_4, the hardware is the bottleneck, which
> explains why there is no difference between UP and SMP.
> In order to get these numbers for 5.2.1, I had to make a small change
> to bridge.c (change ETHER_ADDR_EQ to BDG_MATCH in bridge_in to avoid
> calling bcmp).  This change boosted performance by about 20%
>
> I ran the kernel profiler for both UP and SMP (5.2.1), and included
> the results of the top functions below.  In the past, I have run the
> profiler against RELENG_4 also, and the main difference with that
> (explaining reduced UP performance) is more overhead due to bus_dma &
> mbuf handling.  When I compare the results of UP & SMP (5.2.1), all
> the functions using mutexes seem to get much more expensive, and
> critical_exit is taking more cycles.  A quick count of mutexes in the
> bridge code path showed that there were 10-20 locks & unlocks for
> each packet.  When as a quick test I added 10 more locks/unlocks to
> the code path, the SMP performance when down to 330Kpps.  This
> indicates that mutexes are much more expensive in SMP than in UP.
>
> I would like to move to CURRENT for new hardware support, and the
> ability to properly use multi-threading in user-space, but can't do
> this until the performance bottlenecks are solved.  I realize that
> 5.x is still a work in progress and hasn't been tuned as well as 4.7
> yet, but are there any plans for optimizations in this area?  Does
> anyone have any suggestions on what else I can try?
>
> Thanks,
>
> Gerrit
>
> (wheel)# sysctl net.link.ether.bridge
> net.link.ether.bridge.version: $Revision: 1.72 $ $Date: 2003/10/31 18:32:08
> $
> net.link.ether.bridge.debug: 0
> net.link.ether.bridge.ipf: 0
> net.link.ether.bridge.ipfw: 0
> net.link.ether.bridge.copy: 0
> net.link.ether.bridge.ipfw_drop: 0
> net.link.ether.bridge.ipfw_collisions: 0
> net.link.ether.bridge.packets: 1299855421
> net.link.ether.bridge.dropped: 0
> net.link.ether.bridge.predict: 0
> net.link.ether.bridge.enable: 1
> net.link.ether.bridge.config: em0:1,em1:1
>
> (wheel)# sysctl kern.polling
> kern.polling.burst: 19
> kern.polling.each_burst: 80
> kern.polling.burst_max: 1000
> kern.polling.idle_poll: 1
> kern.polling.poll_in_trap: 0
> kern.polling.user_frac: 5
> kern.polling.reg_frac: 120
> kern.polling.short_ticks: 0
> kern.polling.lost_polls: 4297586
> kern.polling.pending_polls: 0
> kern.polling.residual_burst: 0
> kern.polling.handlers: 3
> kern.polling.enable: 1
> kern.polling.phase: 0
> kern.polling.suspect: 1030517
> kern.polling.stalled: 40
> kern.polling.idlepoll_sleeping: 0
>
>
> Here are some of the interesting parts of the config file:
> options         HZ=2500
> options         NMBCLUSTERS=32768
> #options        GDB_REMOTE_CHAT
> #options        INVARIANTS
> #options        INVARIANT_SUPPORT
> #options        DIAGNOSTIC
>
> options         DEVICE_POLLING
>
>
>
> The following profiles show only the top functions (more than 0.2%):
>
> UP:
>
> granularity: each sample hit covers 16 byte(s) for 0.01% of 10.01 seconds
>
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls  ms/call  ms/call  name
>  20.3       2.03     2.03                             ether_input [1]
>  10.5       3.09     1.06                             mb_free [2]
>   5.8       3.67     0.58
> _bus_dmamap_load_buffer [3]
>   5.6       4.23     0.56                             m_getcl [4]
>   5.3       4.76     0.53                             em_encap [5]
>   5.1       5.27     0.51                             m_free [6]
>   5.1       5.78     0.51                             mb_alloc [7]
>   4.9       6.27     0.49                             bdg_forward [8]
>   4.9       6.76     0.49
> em_process_receive_interrupts [9]
>   4.1       7.17     0.41                             bridge_in [10]
>   3.6       7.53     0.36                             generic_bcopy [11]
>   3.6       7.89     0.36                             m_freem [12]
>   2.6       8.14     0.26                             em_get_buf [13]
>   2.2       8.37     0.22
> em_clean_transmit_interrupts [14]
>   2.2       8.59     0.22                             em_start_locked [15]
>   2.0       8.79     0.20                             bus_dmamap_load_mbuf
> [16]
>   1.9       8.99     0.19                             bus_dmamap_load [17]
>   1.3       9.11     0.13                             critical_exit [18]
>   1.1       9.23     0.11                             em_start [19]
>   1.0       9.32     0.10                             bus_dmamap_create [20]
>   0.8       9.40     0.08                             em_receive_checksum
> [21]
>   0.6       9.46     0.06                             em_tx_cb [22]
>   0.5       9.52     0.05                             __mcount [23]
>   0.5       9.57     0.05
> em_transmit_checksum_setup [24]
>   0.5       9.62     0.05                             m_tag_delete_chain
> [25]
>   0.5       9.66     0.05                             m_adj [26]
>   0.3       9.69     0.03                             mb_pop_cont [27]
>   0.2       9.71     0.02                             bus_dmamap_destroy
> [28]
>   0.2       9.73     0.02                             mb_reclaim [29]
>   0.2       9.75     0.02                             ether_ipfw_chk [30]
>   0.2       9.77     0.02                             em_dmamap_cb [31]
>
> SMP:
>
> granularity: each sample hit covers 16 byte(s) for 0.00% of 20.14 seconds
>
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls  ms/call  ms/call  name
>  47.9       9.64     9.64                             cpu_idle_default [1]
>   4.9      10.63     0.99                             critical_exit [2]
>   4.6      11.56     0.93                             mb_free [3]
>   4.3      12.41     0.86                             bridge_in [4]
>   4.2      13.26     0.84                             bdg_forward [5]
>   4.1      14.08     0.82                             mb_alloc [6]
>   3.9      14.87     0.79
> em_process_receive_interrupts [7]
>   3.2      15.52     0.65                             em_start [8]
>   3.1      16.15     0.63                             m_free [9]
>   3.0      16.76     0.61
> _bus_dmamap_load_buffer [10]
>   2.5      17.27     0.51                             m_getcl [11]
>   2.1      17.69     0.42                             em_start_locked [12]
>   1.9      18.07     0.37                             ether_input [13]
>   1.5      18.38     0.31                             em_encap [14]
>   1.1      18.61     0.23                             bus_dmamap_load [15]
>   1.0      18.82     0.21                             generic_bcopy [16]
>   0.9      19.00     0.18                             bus_dmamap_load_mbuf
> [17]
>   0.8      19.16     0.17                             __mcount [18]
>   0.6      19.29     0.13                             em_get_buf [19]
>   0.6      19.41     0.12
> em_clean_transmit_interrupts [20]
>   0.5      19.52     0.11                             em_receive_checksum
> [21]
>   0.4      19.60     0.09                             m_gethdr_clrd [22]
>   0.4      19.69     0.08                             bus_dmamap_create [23]
>   0.3      19.75     0.06                             em_tx_cb [24]
>   0.2      19.80     0.05                             m_freem [25]
>   0.2      19.83     0.03                             m_adj [26]
>   0.1      19.85     0.02                             m_tag_delete_chain
> [27]
>   0.1      19.87     0.02                             bus_dmamap_destroy
> [28]
>   0.1      19.89     0.02                             mb_pop_cont [29]
>   0.1      19.91     0.02                             em_dmamap_cb [30]
>   0.1      19.92     0.02
> em_transmit_checksum_setup [31]
>   0.1      19.94     0.01                             mb_alloc_wait [32]
>   0.1      19.95     0.01                             em_poll [33]
>
>
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
Received on Tue May 04 2004 - 11:12:57 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC