On Wed, 17 Jan 2007, I wrote: > ... > P4 (nosedive's Xeon): movdqa 17% faster than movsl, but all other cached > moves slower using MMX or SSE[1-2]; movnt with block prefetch 60% faster > than movsl with no prefetch, but < 5% faster with no prefetch for both. > AXP: (my 5 year old system with a newer CPU): movq through MMX is 60% > faster than movsl for cached moves, but movdqa through XMM is only 4% > faster. movnt with block prefetch is 155% faster than movsl with no > prefetch, and 73% faster with no prefetch for both. > A64 in 32-bit mode: in between P4 and AXP (closer to AXP). movsl doesn't > lose by so much, and prefetchnta actually works so block prefetch is > not needed and there is a better chance of prefetching helping more > than benchmarks. And MMX/XMM registers ar not needed to get movnt on machines with SSE2, since movnti is part of SSE2. This reduces the advantages of using MMX/XMM registers on P4's and A64's in 32-bit mode to the non-nt parts of the above (fully cached case), which I think are less important than the nt parts. Another complication with movnt is that its semantics are very machine- dependent. On AXP, movnt to a target that happens to be in the L1 cache goes at L1 cache speed, so it is probably good to use movnt blindly (except movnti doesn't exist so you can't just substitute movl with movnti and must use XMM registers with all their complications), but on P4 and A64, movnt to a cached target goes at main memory speed so you only want to use it intentionally to avoid thrashing the caches. BruceReceived on Wed Jan 17 2007 - 11:00:46 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:04 UTC