Re: Optimized copy&move (was: Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs)

From: Bruce Evans <bde_at_zeta.org.au>
Date: Sun, 21 Jan 2007 15:03:48 +1100 (EST)
On Sat, 20 Jan 2007, David Malone wrote:

> On Thu, Jan 18, 2007 at 11:16:19AM +1100, Bruce Evans wrote:
>> - the FPU routines are faster on Athlons (XP and 64 at least), but these
>>   didn't exist until 2001.  The introduction of these CPUs may have
>>   been the trigger for turning off the FPU routines in -current in 2001.
>>   Until then problems were limited to Pentium-1's since the dynamic
>>   configuration prevented the routines being used on all other machines.
>
> I think a very quirky K6-2 machine that I had let us reproduce the
> problem fairly dependably and may have been part of the reason it
> was finally turned off.

I just looked again at your old (2001) mail about this.  The userland
benchmark was flawed.  It tried 3 methods sequentially without warming
up caches, so all methods did unintended testing of I-cache misses
(including branch target cache cache) and the first method (userland
bzero) warmed up the D-cache for the other 2.  The kernel runtime
configuration also fails to either warm or cool the caches initially.
It assumes P1 cache sizes and depends on a 1MB buffer being much larger
than caches.  Maybe this was not enough for K6-2.  It is certainly not
enough for Athlon64, but I think it would mostly cause false negatives
so I don't understand why it gave a false positive for the K6-2.

After fixing the userland benchmark, userland bzero did much better
and your benchmark agreed with mine that FPU methods for bzero are
just pessimizations on A64-AXP.  However, the behaviour for bcopy
is quite different on A64-AXP -- even the old FPU methods are small
optimizations in some cases (on A64, about 25% in the fully-L2 cached
case; little difference for other large copies).

Bruce
Received on Sun Jan 21 2007 - 03:03:52 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:05 UTC