Re: Optimized copy&move (was: Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs)

From: Bruce Evans <bde_at_zeta.org.au>
Date: Thu, 18 Jan 2007 11:16:19 +1100 (EST)
On Wed, 17 Jan 2007, Attilio Rao wrote:

> 2007/1/17, Ivan Voras <ivoras_at_fer.hr>:
>> Bruce Evans wrote:
>> 
>> > And MMX/XMM registers ar not needed to get movnt on machines with SSE2,
>> > since movnti is part of SSE2.  This reduces the advantages of using 
>> MMX/XMM
>> > registers on P4's and A64's in 32-bit mode to the non-nt parts of the
>> > above (fully cached case), which I think are less important than the nt
>> > parts.

One more point on movnt*:

- The i386 (32-bit) kernel already uses movnti in the one place where it is
   certain to be an optimization (for zeroing pages but not for copying
   anything) if and only if movnti is known for sure to be available
   (esentially, if the CPU supports SSE2).  See i386/pmap.c and
   i386/support.s.

>> Hmm, I'm looking at i386/i386/support.s and there are several versions
>> of bcopy and bmove functions, including some that optimize by using FPU
>> registers (large_i586_bcopy_loop), and a version that uses movnti
>> (sse2_pagezero), but I can't find the bit of magic which glues them to
>> bzero() call.

sse2_pagezero() is the SSE2 optimization mentioned above.  It is glued
in pmap.c.  The i586 bcopy functions are my old mistakes which I'm
trying to bury :-).  They are glued in isa/npx.c.  There is a runtime
test for their efficiency there, and config-time configuration by npx
flags (see NOTES).  All use of the FPU routines is disabled in all
released versions of FreeBSD later than FreeBSD-4 (the config-time
configuration is ignored and the dynamic test and the glueing are not
done; only the actual routines are compiled, so in theory you could
enable them by glueing them in a kmod).

>> Also, as as I can tell by the comments, the FPU version works by
>> manually saving context... why is this possible (i.e. won't something
>> preempt it?)

In RELENG_4, the kernel is not preemptible so preemption isn't a problem.
In later versions, preemption is a problem so the FPU routines are disabled.

RELENG_4 has a limited amount of preemption for interrupt handlers, and
the FPU routines have a limited amount of recursive saving of contexts
to support this.  No bugs are known in this under RELENG_4.  Under later
versions, the recursive saving doesn't quite work.

RELENG_4 has a limited amount of support for SMP, and the FPU routines
have a limited amount of locking to support this.  No bugs are known
in this under RELENG_4, but I wouldn't trust it without testing.  It
has probably been tested enough under RELENG_4 by now, but it might
never have been tested before 2001 when the FPU routines were turned
off in -current, because there were no machines with the critical
combination of features until relatively recently:
- SMP machines with Pentium-1's were rare and are now rarer
- the FPU routines are slower on P2-P4, K5 and K6 so the dynamic
   configuration should prevent them being used on machines with P2-P4,
   K5 and K6.
- the FPU routines are faster on Athlons (XP and 64 at least), but these
   didn't exist until 2001.  The introduction of these CPUs may have
   been the trigger for turning off the FPU routines in -current in 2001.
   Until then problems were limited to Pentium-1's since the dynamic
   configuration prevented the routines being used on all other machines.

> They are just broken.
> My implementation, which follows DragonFlyBSD patterns, just use a bts
> (which is atomic) in order to set a "lock" and avoid thread migration
> with scheduler pinning. This is enough to solve concurrency problems.

There is a bit more to it than that :-).

The old implementation uses a sar <mem> instruction for the same purpose.
Neither bts nor sar <mem> is atomic, but both can be made atomic using
a lock prefix.  The old implementation neglects to do this, so the
instruction is only atomic with respect to interrupts.  If it works at
all for the SMP case under RELENG_4, then it is because Giant locking
prevents all types of preemption.  Giant locking certainly prevents
process preemption, but it is less clear that it prevents interrupt
handlers running on other CPUs from getting far enough to clobber the
lock.  I think it does.  The unlocked sar just doesn't work under
-current, especially starting much later than 2001 when the kernel
became fully preemptible.  (I like to use sar instead of bts/cmpxchg/
whatever since it is more portable.)

Bruce
Received on Wed Jan 17 2007 - 23:16:24 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:04 UTC