Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs

From: Bruce Evans <bde_at_zeta.org.au> Date: Wed, 17 Jan 2007 23:00:42 +1100 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:04 UTC

On Wed, 17 Jan 2007, I wrote:

> ...
> P4 (nosedive's Xeon): movdqa 17% faster than movsl, but all other cached
>   moves slower using MMX or SSE[1-2]; movnt with block prefetch 60% faster
>   than movsl with no prefetch, but < 5% faster with no prefetch for both.
> AXP: (my 5 year old system with a newer CPU): movq through MMX is 60%
>   faster than movsl for cached moves, but movdqa through XMM is only 4%
>   faster.  movnt with block prefetch is 155% faster than movsl with no
>   prefetch, and 73% faster with no prefetch for both.
> A64 in 32-bit mode: in between P4 and AXP (closer to AXP).  movsl doesn't
>   lose by so much, and prefetchnta actually works so block prefetch is
>   not needed and there is a better chance of prefetching helping more
>   than benchmarks.

And MMX/XMM registers ar not needed to get movnt on machines with SSE2,
since movnti is part of SSE2.  This reduces the advantages of using MMX/XMM
registers on P4's and A64's in 32-bit mode to the non-nt parts of the
above (fully cached case), which I think are less important than the nt
parts.

Another complication with movnt is that its semantics are very machine-
dependent.  On AXP, movnt to a target that happens to be in the L1
cache goes at L1 cache speed, so it is probably good to use movnt
blindly (except movnti doesn't exist so you can't just substitute movl
with movnti and must use XMM registers with all their complications),
but on P4 and A64, movnt to a cached target goes at main memory speed
so you only want to use it intentionally to avoid thrashing the caches.

Bruce