Anomalous performance increase from mutex profiling

From: Kris Kennaway <kris_at_obsecurity.org>
Date: Mon, 17 Apr 2006 01:14:40 -0400
I have been measuring performance with the supersmack benchmark on a
12-cpu sparc64 E4500.  This is in the context of a patch I made to do
exponential backoff during mutex spins, which has a large positive
effect (~17%) in this case.

However, even with CVS sources I found something very odd: when
MUTEX_PROFILING is compiled in but not enabled (which usually still
makes things slower), I see a clear 38% +/- 3% performance *increase*
on this benchmark.

The effects of MUTEX_PROFILING are (as far as I can see):
	- Increases size of struct mtx
	- Enables MUTEX_NOINLINE
	- Enables unconditional atomic_add_int at kern_mutex.c:484
	  (this may be unintentional; I think it should be conditional on
	  mutex_prof_enable = 1)
	- Adds extra reads/writes of the struct mtx in the mutex
	  lock/unlock ops
	- Adds a chunk of code to _mtx_unlock_flags

But:
	- Removing the atomic_add_int from kern_mutex.c:484 seems to
	  make it about 5% *slower*.  I didn't measure this carefully
	  yet though.

When MUTEX_PROFILING is disabled:
	- MUTEX_NOINLINE doesn't help performance noticeably
	- Unconditionally changing size of struct mtx doesn't help

- Compiling -O2 instead of -O doesn't help

- Gutting the bulk of the mutex profiling code and leaving only the
  extra struct mtx accesses did not help

- Here is the readelf output of the function table in kern_mutex.o in
  the non-mutex profiling case:

   Num:    Value          Size Type    Bind   Vis      Ndx Name
     9: 0000000000000900   404 FUNC    LOCAL  DEFAULT    1 db_show_mtx
    18: 0000000000000000    68 FUNC    GLOBAL DEFAULT    1 _mtx_lock_flags
    19: 0000000000000220   420 FUNC    GLOBAL DEFAULT    1 _mtx_lock_sleep
    20: 0000000000000060    64 FUNC    GLOBAL DEFAULT    1 _mtx_unlock_flags
    21: 0000000000000500   508 FUNC    GLOBAL DEFAULT    1 _mtx_unlock_sleep
    22: 00000000000000a0   112 FUNC    GLOBAL DEFAULT    1 _mtx_lock_spin_flags
    24: 00000000000003e0   236 FUNC    GLOBAL DEFAULT    1 _mtx_lock_spin
    25: 0000000000000120    92 FUNC    GLOBAL DEFAULT    1 _mtx_unlock_spin_flags
    27: 0000000000000180   132 FUNC    GLOBAL DEFAULT    1 _mtx_trylock
    30: 00000000000004e0     8 FUNC    GLOBAL DEFAULT    1 do_nothing
    42: 0000000000000700    36 FUNC    GLOBAL DEFAULT    1 mtx_sysinit
    43: 0000000000000740   140 FUNC    GLOBAL DEFAULT    1 mtx_init
    45: 00000000000007e0    96 FUNC    GLOBAL DEFAULT    1 mtx_destroy
    48: 0000000000000840   188 FUNC    GLOBAL DEFAULT    1 mutex_init

and in the mutex profiling case:
     9: 0000000000000ec0   404 FUNC    LOCAL  DEFAULT    1 db_show_mtx
    44: 0000000000000000    40 FUNC    LOCAL  DEFAULT    1 nanoseconds
    46: 0000000000000040   584 FUNC    LOCAL  DEFAULT    1 dump_mutex_prof_stats
    51: 00000000000002a0   192 FUNC    LOCAL  DEFAULT    1 reset_mutex_prof_stats
    68: 0000000000000ce0    36 FUNC    GLOBAL DEFAULT    1 mtx_sysinit
    69: 0000000000000dc0    96 FUNC    GLOBAL DEFAULT    1 mtx_destroy
    71: 0000000000000740    92 FUNC    GLOBAL DEFAULT    1 _mtx_unlock_spin_flags
    75: 00000000000006c0   112 FUNC    GLOBAL DEFAULT    1 _mtx_lock_spin_flags
    81: 0000000000000360   132 FUNC    GLOBAL DEFAULT    1 _mtx_lock_flags
    82: 0000000000000840   484 FUNC    GLOBAL DEFAULT    1 _mtx_lock_sleep
    83: 0000000000000400   684 FUNC    GLOBAL DEFAULT    1 _mtx_unlock_flags
    86: 0000000000000b60   360 FUNC    GLOBAL DEFAULT    1 _mtx_unlock_sleep
    88: 0000000000000a40   236 FUNC    GLOBAL DEFAULT    1 _mtx_lock_spin
    90: 00000000000007a0   132 FUNC    GLOBAL DEFAULT    1 _mtx_trylock
    93: 0000000000000b40     8 FUNC    GLOBAL DEFAULT    1 do_nothing
   105: 0000000000000d20   160 FUNC    GLOBAL DEFAULT    1 mtx_init
   109: 0000000000000e20   160 FUNC    GLOBAL DEFAULT    1 mutex_init

Most mutex operations grow a bit with MUTEX_PROFILING (except
_mtx_unlock_sleep which shrinks because of the MUTEX_NOINLINE, see
above), but the biggest growth is from the extra blob of unused (in
this case) code in _mtx_unlock_flags().  This function is 0x100
aligned in the mutex profiling case.

* I tried forcing 0x100 alignment on kern_mutex.o in the !mutex
profiling case, which did not help (also _mtx_unlock_flags() should
not be taking much CPU time; the bulk of it should be in the mutex
spins since there are some very heavily contended mutexes)

* Kernel profiling is apparently not supported on sparc, so I can't
easily work out where it is spending time.

* Our best guess is that mutex profiling is doing something that
reduces contention on this very heavily contended mutex (unp), but I'd
like to know what is happening precisely so I can maybe make use of
it.

Can anyone think of what may be happening that I've missed?

Kris

Received on Mon Apr 17 2006 - 03:14:42 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:54 UTC