For those of you who haven't been using pmc for kernel profiling, you definitely ought to consider it. Joseph Koshy's tools are very easy to use, and the results are quite informative. The output mode to convert pmc samples to something gmon understands is particularly neat. In thirty seconds, here's how to get it working: (1) Compile your kernel with device hwpmc, options HWPMC_HOOKS. (2) Run "pmcstat -S instructions -O /tmp/sample.out" to start sampling of instruction retirement events, saving the results to /tmp/sample.out. (3) Exercise your kernel code. (4) Exit pmcstat -- I hit ctrl-c, but there are probably more mature ways of setting this up. (5) Run "pmcstat -R /tmp/sample.out -k /zoo/tiger-2/boot/kernel/kernel -g" to convert the results to gmon output format so it can be processed by gprof. Obviously, you need to set the right path to your boot kernel -- by default, pmcstat uses /boot/kernel/kernel for kernel sample results. (6) View the results using gprof, "gprof /zoo/tiger-2/boot/kernel/kernel p4-instr-retired/gmon.out". Again, update the path as needed. Since there is no call graph information in the sample, the first few pages of gprof output will be of limited utility, but the summary table by function is the bit I found most useful: % cumulative self self total time seconds seconds calls ms/call ms/call name 13.6 7251.00 7251.00 0 100.00% _mtx_lock_sleep [1] 3.7 9213.00 1962.00 0 100.00% _mtx_lock_spin [2] 3.3 10956.00 1743.00 0 100.00% bus_dmamap_load_mbuf_sg [3] 2.7 12370.00 1414.00 0 100.00% tcp_input [4] 2.6 13781.00 1411.00 0 100.00% tcp_output [5] 2.6 15172.00 1391.00 0 100.00% spinlock_exit [6] 2.5 16496.00 1324.00 0 100.00% uma_zalloc_arg [7] 2.0 17555.00 1059.00 0 100.00% spinlock_enter [8] 1.9 18553.00 998.00 0 100.00% uma_zfree_arg [9] 1.6 19409.00 856.00 0 100.00% em_rxeof [10] 1.6 20260.00 851.00 0 100.00% sleepq_signal [11] 1.5 21071.00 811.00 0 100.00% em_get_buf [12] 1.4 21821.00 750.00 0 100.00% rn_match [13] 1.3 22531.00 710.00 0 100.00% ether_demux [14] 1.3 23222.00 691.00 0 100.00% ip_output [15] ... In this sample, there's a mutex contention problem, which shows up clearly. Without the call graph, there's more work yet to do to identify the source of contention, but the fact that pmc exhibits a relatively low overhead, not to mention higher accuracy, than the existing kernel profiling is great. And, this isn't limited to instruction retirement. You can also profile by cache line miss, mis-predicted branches, etc. Here's an excerpt from the resource stall sample on the same workload: % cumulative self self total time seconds seconds calls ms/call ms/call name 3.9 3225.00 3225.00 0 100.00% m_freem [1] 3.7 6253.00 3028.00 0 100.00% ether_input [2] 3.5 9116.00 2863.00 0 100.00% cpu_switch [3] 2.9 11470.00 2354.00 0 100.00% mb_ctor_pack [4] 2.7 13681.00 2211.00 0 100.00% uma_zalloc_arg [5] 2.7 15882.00 2201.00 0 100.00% em_rxeof [6] 2.6 18045.00 2163.00 0 100.00% em_intr_fast [7] 2.6 20148.00 2103.00 0 100.00% intr_event_schedule_thread [8] 2.2 21941.00 1793.00 0 100.00% if_handoff [9] 2.2 23727.00 1786.00 0 100.00% sleepq_signal [10] 1.8 25222.00 1495.00 0 100.00% _mtx_lock_sleep [11] 1.7 26612.00 1390.00 0 100.00% tcp_output [12] 1.7 27997.00 1385.00 0 100.00% bus_dmamap_load_mbuf_sg [13] 1.6 29309.00 1312.00 0 100.00% uma_zfree_arg [14] So if you're doing kernel performance work, and not already using pmc, you probably should be. Robert N M WatsonReceived on Thu Feb 23 2006 - 13:51:46 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:52 UTC