RE: 4.7 vs 5.2.1 SMP/UP bridging performance

From: Bruce Evans <bde_at_zeta.org.au> Date: Wed, 5 May 2004 23:32:18 +1000 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC

On Tue, 4 May 2004, Gerrit Nagelhout wrote:

> I ran the following fragment of code to determine the cost of a LOCK &
> UNLOCK on both UP and SMP:
>
> #define	EM_LOCK(_sc)		mtx_lock(&(_sc)->mtx)
> #define	EM_UNLOCK(_sc)		mtx_unlock(&(_sc)->mtx)
>
>     unsigned int startTime, endTime, delta;
>     startTime = rdtsc();
>     for (i = 0; i < 100; i++)
>     {
>         EM_LOCK(adapter);
>         EM_UNLOCK(adapter);
>     }
>     endTime = rdtsc();
>     delta = endTime - startTime;
>     printf("delta %u start %u end %u \n", (unsigned int)delta, startTime,
> endTime);
>
> On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles (per LOCK&UNLOCK,
> and dividing by 100) under UP, and ~300 cycles for SMP.  Assuming 10
> locks for every packet(which is conservative), at 500Kpps, this accounts
> for:
> 300 * 10 * 500000 = 1.5 billion cycles (out of 2.8 billion cycles)

300 cyles seems far too much.  I get the following times for slightly
simpler locking in userland:

%%%
#define _KERNEL
#include ...

int slock;
...
	for (i = 0; i < 1000000; i++) {
		while (atomic_cmpset_acq_int(&slock, 0, 1) == 0)
			;
		atomic_store_rel_int(&slock, 0);
	}
%%%

Athlon XP2600 UP system:  !SMP case: 22 cycles   SMP case: 37 cycles
Celeron 366 SMP system:              35                    48

The extra cycles for the SMP case are just the extra cost of a one lock
instruction.  Note that SMP should cost twice as much extra, but the
non-SMP atomic_store_rel_int(&slock, 0) is pessimized by using xchgl
which always locks the bus.  After fixing this:

Athlon XP2600 UP system:  !SMP case:  6 cycles   SMP case: 37 cycles
Celeron 366 SMP system:              10                    48

Mutexes take longer than simple locks, but not much longer unless the
lock is contested.  In particular, they don't lock the bus any more
and the extra cycles for locking dominate (even in the !SMP case due
to the pessimization).

So there seems to be something wrong with your benchmark.  Locking the
bus for the SMP case always costs about 20+ cycles, but this hasn't
changed since RELENG_4 and mutexes can't be made much faster in the
uncontested case since their overhead is dominated by the bus lock
time.

-current is sloer than RELENG_4, especially for networking, because
it does lots more locking and may contest locks more, and when it hits
a lock and for some other operations it does slow context switches.
Your profile didn't seem to show much of the latter 2, so the problem
for bridging may be that there is just too much fine-grained locking.

The profile didn't seem quite right.  I was missing all the call counts
and times.  The times are not useful for short runs unless high
resolution profiling is used, but the call counts are.  Profiling has
been broken in -current since last November so some garbage needs to
be ignored to interpret profiles.

Bruce