Re: Atomic operations on i386/amd64

From: Bruce Evans <bde_at_zeta.org.au> Date: Wed, 11 Aug 2004 18:56:23 +1000 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:05 UTC

On Thu, 5 Aug 2004, John Baldwin wrote:

> On Thursday 05 August 2004 01:04 am, Tim Robbins wrote:
> > Is there any particular reason why atomic_load_acq_*() and
> > atomic_store_rel_*() are implemented with CMPXCHG and XCHG instead of
> > MOV on i386/amd64 UP?

It is because a something like a locked instruction must be used for
synchronization on old i386's, and "LOCK; MOV" is an invalid instruction.

I ran AthlonXP and Celeron UP systems for some time with MOV instead of
XCHG in atomic_store_rel_*(), but changed back to XCHG since it didn't
make much difference and I'm not sure what is best for my other systems
(Celeron SMP and amd64).

> Actually, using mov instead of lock xchg for store_rel reduced performance in
> some benchmarks Scott ran on an SMP machine, I'm guessing due to the higher
> latency of locks becoming available to other CPUs.  I'm still waiting for
> benchmark results on UP to see if the change should be made under #ifndef SMP
> or some such.

I don't believe unlocked instructions could be slower, and using
unlocked && unfenced instructions is just broken in the SMP case.
Perhaps there is enough synchronization provided by the lock in load_acq
(which in theory needs less locking than store_rel) for missing
synchronization in store_rel to sort of work.

> > Also, could we use MFENCE/LFENCE/SFENCE in combination with MOV on
> > SMP systems instead of LOCK CMPXCHG / (implied LOCK) XCHG?

It isn't clear to me (from amd64 manuals) that *FENCE affects caches
other than ones seen by the current CPU.  I think they do, and can be
used (MFENCE might be needed for both).  They should work for the same
reasons that "LOCK MOV" is an invalid  instruction (MOV is inherently
atomic (?)).  Apparently we are using fake "LOCK MOV"s just for the
side effects of the lock instruction (at least on amd64's, the lock
instruction does *FENCE implicitly).

> MFENCE and LFENCE only exist on the P4.  SFENCE only exists on P3+, so to do
> so you'd lose the ability to run on PII's and earlier.  Also, if you use more
> than SFENCE you lose PIII's.  Note that amd64 could probably be changed
> though since they might all have fences, in which case that might be
> something to benchmark on both UP and SMP to see what kind of difference it
> makes.

amd64 does have all fences.

See the thread "RE: 4.7 vs 5.2.1 SMP/UP bridging performance" in
freebsd-current for some benchmarks.  Locked instructions seem to be
relatively fast on amd64's (same as on old systems in cycles), with
fences not much faster (about 15 instead of about 30 cycles IIRC).
Fences on P4 make locking only twice as slow as on amd64 instead of
5-10 times slower.

Bruce