RE: 4.7 vs 5.2.1 SMP/UP bridging performance

From: Bruce Evans <bde_at_zeta.org.au> Date: Thu, 6 May 2004 18:02:10 +1000 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC

On Wed, 5 May 2004, Andrew Gallatin wrote:

> Bruce Evans writes:
>  > So there seems to be something wrong with your benchmark.  Locking the
>  > bus for the SMP case always costs about 20+ cycles, but this hasn't
>  > changed since RELENG_4 and mutexes can't be made much faster in the
>  > uncontested case since their overhead is dominated by the bus lock
>  > time.
>
> Actually, I think his tests are accurate and bus locked instructions
> take an eternity on P4.  See
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html
>
> For example, with your test above, I see 212 cycles for the UP case on
> a 2.53GHz P4.  Replacing the atomic_store_rel_int(&slock, 0) with a
> simple slock = 0; reduces that count to 18 cycles.

This seems to be right, unfortunately.  I wonder if this has anything to
do with freebsd.org having no P4 machines.

> If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
> then I think you should do it.  Of course, that still leaves mutexes
> as very expensive on SMP (253 cycles on the 2.53GHz from above).

I forgot (again) that there are memory access ordering issues.  A lock
may be needed to get everything synced.  See the comment before the i386
versions in i386/include/atomic.h.  A single lock may be enough.  The
best example I could think of easily is:

%%%
int foo;			/* supposedly protected by sched_lock */
	...
	mtx_lock(&mtx);
	if (foo == 0)
		foo++;
	mtx_unlock(&mtx);
	KASSERT(foo == 1, ("oops"));
%%%

On at least amd64's, reads can be done out of order relative to all
other reads and relative to writes to different memory locations.
mtx_lock(&mtx) doesn't go near foo's memory location, so foo may be
read before the lock is acquired.  If this code is interrupted after
foo is read but also before the lock is required, then the interrupt
handler may run the same code and bump foo to 1.  Then on return, if
the out of order read is still valid, then the above will bump foo
again.  However, the lock in the mtx_unlock() in the interrupt handler
presumably makes the out of order read invalid (does it?), so on return
from the interrupt handler foo will be read again and found to be 1
(since even if the read is out of order related to the mtx locking,
it is ordered relative to the write to foo's memory location).

If this is correct, then someone's home made locking using
atomic_cmpset for unlock was more than a style bug :-).

To possibly reduce the locking overhead for at least the non-SMP case
on i386's and amd64's, there are the [lms]fence instructions and
serializing instructions.  On on AthlonXPs, movl + sfence takes about
half as many cycles as xchgl.  I think lfence is actually needed, but
AthlonXPs only have sfence.  All the serializing instructions seem to
be too heavyweight to help here.  On sledge's amd64, the saving using
lfence is smaller (atomic_cmpset_acq_int + movl: 6 cycles; + lfence:
15 cycles; atomic_cmpset_acq_int + atomic_store_rel_int: 21 cycles)
(all including loop overhead).

Bruce