Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs

From: Bruce Evans <bde_at_zeta.org.au>
Date: Thu, 18 Jan 2007 18:03:20 +1100 (EST)
On Wed, 17 Jan 2007, Matthew Dillon wrote:

>    The cost of using the FPU can simply be thought of in terms of how
>    many bytes you have to have to copy for it to become worth using
>    the FPU over a far less complex integer copy loop.

It's not that simple...

> ...
>      In fact, I would say that if userland is not using the FP unit,
>      that is npxthread == NULL or npxthread != curthread, you should
>      *DEFINITELY* use the FP unit.  Hands down, no question about it.

Only if the raw FPU is actually faster.

>    * First, raw memory bandwidth is governed by RAS cycles.  The fewer RAS
>      cycles you have, the higher the bandwidth.
>
>      This means that the more data you can load into the cpu on the 'read'
>      side of the copy before transitioning to the 'write' side, the better.
>
>      With XMM you can load 128 *BYTES* a shot (8 128 bit registers).  For
>      large copies, nothing beats it.

No, this is very machine-dependent.  Read-write transitioning is slow, but
it is almost a non-problem since it can be avoided by prefetchiung the
read side.  Dirtying the cache with the read side is unavoidable on
current i386-amd64 CPUs.  Prefetching makes this a feature:

 	prefetch N bytes, preferably using a working prefetchnta (1)
 	write N bytes, possibly to the cache (2),
 		       possibly many less than just one XMM register's
 		       worth at a time (3)

Just make N much larger than 128 for this to easily beat simply loading
and storing 128 bytes at a time (4) (5).

(1) If the source is already in the cache, the extra instructions for
     the prefetch are a bit wasteful.  Using prefetchnta limits the
     waste, perhaps to 0 depending on whether prefetchnta can be
     executed in an otherwise unised pipeline.  However, a working
     prefetchnta is required for the prefetch to actually occur in
     the case that the source isn't already in the cache.  A64's have
     a working prefetchnta, but AXP's don't.

(2) Reads and writes through the L1 cache give much the same benefits as
     reads and writes through many registers.  Reads must go through
     the L1 cache.

(3) Write combining of sequential writes makes the number of memory
     accesses independent of the instruction operand size, at least for
     cached writes.  If writes are direct, then the issues are different.
     I've never seen 128-bit nontemporal writes go faster than 64-bit ones
     so maybe they are combined too.  (I may have seen this on AXP while
     writing this.)

(4) Using only registers 128 bytes at a time may beat prefetch in cases
     where the prefetch is too wasteful.  However, I think these cases
     are limited ones where the copy size is small so you wouldn't want
     to use prefetch or nontemporal writes and any benefits from using
     XMM registers are marginal (because of the context switching
     overheads).  Such cases are handled well by caching.  If the cache
     line size is 128, then read/write through the cache gives RAS cycles
     that are independent of the instruction operand sizes.  Using 8*16
     bytes from registers can only win if the cache line size is 64.

(5) A64's have 16 not-very general integer 64-bit integer registers and
     16 XMM registers, so you can get close to 128 bytes using integer
     registers (16*8) and can get 256 bytes using XMM registers (16*16).
     Caching and/or write buffering turns writes of the "small" 64-bit
     registers or even 32-bit registers into much larger writes, so levels
     of the memory system below L1 don't notice if we write tiny amounts
     at a time provided there is enough instruction bandwidth to keep up.

>    * Modern cpu hardware uses a 128 bit data path for 128 bit media
>      instructions and can optimize the 128 bit operation all the way through
>      to a cache line or to main memory.  It can't be beat.

Actually, integer instructions beat it easily on A64:

Fully cached case:
% copy0: 16907105855 B/s ( 242265 us) (movsq)
% ...
% copyQ: 13658478027 B/s ( 299887 us) (movdqa)
% ...
% copyT: 16456408196 B/s ( 248900 us) (movq (i))

This doesn't contradict your claim since main memory is not really involved.
Everything should be limited by instruction and cache bandwidth, but for
some reason movdqa is quite slow despite or because of my attempts to
schedule it perfectly.

Fully uncached case:
% copy0:  829571300 B/s ( 493749 us) (movsq)
% ...
% copyQ:  835822845 B/s ( 490056 us) (movdqa)
% copyR:  908423544 B/s ( 450891 us) (movdqa with prefetchnta)
% copyS:  919026496 B/s ( 445689 us) (movdqa with block prefetch)
% copyT:  828993732 B/s ( 494093 us) (movq (i))
% copyU:  905141362 B/s ( 452526 us) (movq (i) with prefetchnta)
% copyV:  910626945 B/s ( 449800 us) (movq (i) with block prefetch)
% copyW: 1350397932 B/s ( 303318 us) (movnti)
% copyX: 1662594069 B/s ( 246362 us) (movnti with prefetchnta)
% copyY: 1608204355 B/s ( 254694 us) (movnti with block prefetch)

Now movdqa beats movsq by a bit, at least with prefetch, but has the
same speed as integer moves of half the size (or MMX moves of half the
size (benchmark data not shown)).  It is necessary to use movnt to get
anywhere near memory bandwidth, and any form of movnt can be used for
this (I test mainly movntps in addition to the above, since it is most
portable).  There is apparently some magic to combine 8-byte writes
if necessary for efficient main memory accesses.  The above was run
on an old A64 overclocked a lot to 2204 MHz, with mediocre DDR-1 PC3200
memory overclocked a bit and tuned a lot.  The maximum bandwidth
achieved is almost exactly PC3200 but the theoretical maximum is more
like PC3400.

>
>      Alignment is critical.  If the data is not aligned, don't bother.  128
>      bits means 16 byte alignment.

The above benchmark output is for aligned data :-).  I don't try hard to
optimize or benchmark misaligned cases.

>    * No extranious memory writes, no uncached extranious memory reads.
>      If you do any writes to memory other then to the copy destination
>      in your copy loop you screw up the cpu's write fifo and destroy
>      performance.
>
>      Systems are so sensitive to this that it is even better to spend the
>      time linearly mapping large copy spaces into KVM and do a single
>      block copy then to have an inner per-PAGE loop.

I haven't tried this, but have seen and partly worked sensitivity to
linear KVA maps not being physically (non)linear enough.  Some CPUs
and/or memory systems are remarkably sensitive to bank interleave.
FreeBSD's page coloring doesn't know anything about banks, and
accidentally starts up with perfect miscoloring for banks.  This can
make a difference of 30% for bzero bandwidth in benchmarks (not so
much for bcopy bandwidth, and an insignificant amount for normal use).
After the system warms up, the coloring becomes random with respect
to banks, and random coloring works much better than perfect miscoloring.

>
>    * Use of prefetch or use of movntdq instead of movdqa is highly
>      problematic.  It is possible to use these to optimize very particular
>      cases but the problem is they tend to nerf all OTHER cases.

This is very machine-dependent:
- prefetchnta seems to work fine on A64 but not on AXP or P4 (see above
   and other postings.  I haven't tested it much for small copy sizes.
   I never saw older prefetch instructions that give more control doing
   anything useful.
- movntps may be free (once you use XMM and if ps instead of dq is OK)
   on AXP, since it runs at cache speed if the target is already cached,
   It is not free on A64 since it always runs at memory speed.  I think
   ps instead of dq is not OK since ps is slow if there are denormals
   in the data, but don't remember if benchmarks support this.
   Unfortunately, the only movnt instruction in AXP is movntps.
- movnt* is a clear win in at least one case: zeroing pages in idle.
   FreeBSD on i386 uses movnti for all zeroing of pages in the SSE2 case.
   This might not be best for non-idle zeroing.  Maybe it should be used
   only for multiple pages.  For single pages, it is probably for a
   pagefault and you want at least the data near the fault address in
   the cache.  FreeBSD on amd64 also uses movnti for all copying of pages.

I think SunOS on amd64 uses movnt* whenever the copy size is >= 128.  This
seems too small.

>      I've given up trying to use either mechanism.  Instead, I prefer
>      copying as large a block as possible to remove these variables from
>      the cpu pipeline entirely.  The cpu has a write fifo anyway, you
>      don't need prefetch instructions if you can use instructions to write
>      to memory faster then available L2 cache bandwidth.  On some cpus
>      this mandates the use of 64 or 128 bit media instructions or the
>      cpu can't keep the write FIFO full and starts interleaving reads
>      and writes on the wrong boundaries (creating more RAS cycles, which
>      is what kills copy bandwidth).

Which modern CPUs can't keep up with L2?

On amd64 for the fully-L2-cached case:
% copy0: 4382675165 B/s ( 934589 us) (movsq)
% ...
% copyN: 3432351420 B/s (1193351 us) (movntq)
% copyO: 1777008820 B/s (2304997 us) (movntq with prefetchnta)
% copyP: 3278961491 B/s (1249176 us) (movntq with block prefetch)
% copyQ: 4369052686 B/s ( 937503 us) (movdqa)
% copyR: 2785886655 B/s (1470268 us) (movdqa with prefetchnta)
% copyS: 4553271385 B/s ( 899573 us) (movdqa with block prefetch)
% copyT: 4401466152 B/s ( 930599 us) (movq (i))
% copyU: 2817507895 B/s (1453767 us) (movq (i) with prefetchnta)
% copyV: 4523562615 B/s ( 905481 us) (movq (i) with block prefetch)
% copyW: 3432429080 B/s (1193324 us) (movnti)
% copyX: 1776507852 B/s (2305647 us) (movnti with prefetchnta)
% copyY: 3136767185 B/s (1305803 us) (movnti with block prefetch)

So here is a case where prefetchnta is bad.  OTOH, movnti is quite fast
provided prefetchnta is not used with it.  movnti is even more competitive
with L2 if main memory is DDR-2.

>    * RAS transitions also have to be aligned or you get boundary cases
>      when the memory address transitions a RAS line.  This again mandates
>      maximal alignment (even more then 16 bytes, frankly, which is why
>      being able to do 128 byte blocks with XMM registers is so nice).
>      Even though reads and writes are reblocked to the cache line size
>      by the cpu, your inner loop can still transition a RAS boundary in
>      the middle of a large block read if it isn't aligned.
>
>      But at this point the alignment requirements start to get kinda
>      silly.  128 byte alignment requirement?  I don't think so.  I
>      do a 16-byte alignment check in DragonFly as a pre-req for using
>      XMM and that's it.

:-).  There are just too many combinations to even think about.  I prefer
to reduce the number of cases by depending on the prefetching working
reasonably well.

>    I don't know about AMD64.  You only have 64 bit general registers in 64
>    bit mode so you may not be able to keep the write pipeline full.  But
>    you do have 8 of them so you are roughly equivalent to MMX (but not
                  16, almost
>    XMM's 8 128 bit registers).

On A64 and AXP, most things are limited by the load/store unit (LSU),
the 3 integer pipelines, and SSE latency (but latency is not a problem
here).

Load/store-unit:

On A64, the LSU can do the following number of interesting L1 cache accesses
per cycle:
     2 64-bit loads
     2 64-bit stores
     1 64-bit load and 1 64-bit store
all to the L1 cache.  When you load or store a 128-bit XMM register, it
takes the same load/store-unit resources as 2 64-bit integer loads or
stores of the same type.  Thus XMM and integer operations can both
achieve the maximum L1 cache bandwidth, but no more of course, so
XMM vs integer is irrelevant for bandwidth considerations for the L1
level, and since lower levels need less bandwidth, it is irrelevant
for all levels except possibly the non-cache-level given by nontemporal
stores.

On AXP, the LSU can do the following number of interesting L1 cache accesses
per cycle:
     2 64-bit loads
     2 32-bit store
     1 64-bit store
     1 64-bit load and 1 64-bit store
Since there are no 64-bit integer registers, you have to use MMX or
XMM to get the 64-bit accesses.  It doesn't matter whether you use
2*MMX or 1*XMM to get to 128 bits -- both become 2*64 bits.  32-bit
integer stores can only use half of the L1 cache bandwidth.  32-bit
integer loads are also limited to 2 per cycle since they apparently
aren't combined at this level.  Combining also doesn't happen for movsl
(this is apparently implemented as a 32-bit load/store).

On A64 in 32-bit mode, the LSU can do 2 64-bit stores/cycle, but this only
helps using XMM registers -- 32-bit accesses apparently never get combined
at this level, so they always go at half speed.

The asymmetric load/store capabilities on the AXP can't be good for using
128-bit XMM registers, especially using 8 loads followed by 8 stores in
the instruction stream.  They make 128 stores want to take twice as long
as 2 64-bits ones, and it takes a large amount of out of order execution
to reorder things to reach the "1 64-bit load and 1 64-bit store" per
cycle pattern.  Reordering is exactly what you don't want for controlling
the interleaving of access to main memory.  Even with a simple load-store-
load-store with 64-bit accesses in the instruction, stream, the there
will be some reordering -- the reads will get 1 ahead of the writes.
This explains why my movdqa benchmark is slow on AXP but not why it is
slow on A64.

Benchmark output on AXP, fully cached case:
% copy0: 5275909727 B/s ( 776359 us) (1417037051 tsc) (movsl)
% ...
% copyD: 6301722666 B/s ( 649981 us) (1175886987 tsc) (kernel bcopy (unroll 64 fp i-prefetch))
% copyH: 4336676894 B/s ( 944502 us) (1708211441 tsc) (movntps)
% ...
% copyK: 8750902650 B/s ( 468066 us) (846786401 tsc) (movq)
% ...
% copyN: 3029184746 B/s (1352179 us) (2445280543 tsc) (movntq)
% ...
% copyQ: 5487174266 B/s ( 746468 us) (1350423254 tsc) (movdqa)

This shows:
- movsl is quite slow
- even the old kernel bcopy is barely worth using on AXP
- movntps isn't too bad in the fully cached case (AXP implemenmtation
   detail)
- movq (through 64-bit MMX) is quite fast, but not quite twice as fast
   as movsl.  I thought that mainly instruction execution bandwidth limits
   prevented it being twice as fast, but maybe it is affected by the
   asymmetry.  It does load-load-store-store repeated through 4 pairs
   of different MMX registers.  I think it was written before I knoew
   anything about the CPU resource issues (I keep some mistakes for
   comparison).  Now I think using different registers is unnecessary,
   and the limit of 8 or 16 MMX or 16 registers need not be hit.
   load-store repeated to the same register, or load-store-load-store
   repeated actually uses a large machine-dependent number of registers
   via register renaming.  Even old AXP's have this.  I think the
   register renaming works across loop boundaries and gives you many
   more than the 8 registers that you might think you are using.  This
   goes with out of order execution and might be controllable only in
   the same way, by stalling at page boundaries.
- movntq is much slower than movntps.  movntps does 4 128-bit loads
   followed by 4 128-bit stores.  movntq does 8 64-bit loads followed
   by 8 64-bit stores.  This seems to show that 128-bit accesses are
   faster.
- movdqa is quite slow.  It uses load-store... of a single 128-bit
   register.  Perhaps it is too optimized away from using multiple
   registers.  In fact, rewriting it to use 8 loads followed by 8
   stores shows one case where it beats movnti (64 bits) -- in the
   fullyL2-cached case, this speeds it up from 4.37GB/S to 4.77GB/S,
   making it about 0.4GB/S faster than both the other version of itself
   and movnti.  But in the fully-L1-cached case, this speeds it down
   from 13.7GB/S to 9.2GB/S, making it much slower than itself and
   almost twice as slow as movsq.  This is hard to explain, especially
   for the L1 case.  In the L2 case, it might be explained by the 8
   stores followed by 8 loads not being reordered very much so that
   they work almost like you want.

Integer pipeline resources:

Both AXP and A64 have 3 integer pipelines (they are remarkably similer
in this and many other respects).  It's easy to do a load/store in
parallel using 2 of the pipelines.  This saturates the LSU (loads must
be at least 1 in advance of stores).  The load/store can also be from
FPU/SSE pipelines, but not independently, and on AXP the integer
pipelines lose LSU bandwidth since they can only do 32-bit accesses.
Other integer operations can run in the 3rd pipeline, but there are
many restrictions which I don't quite understand or remember.  These
result in it being hard to keep the 3rd pipeline full.   The above
movq (i) benchmark for A64 indicates that it may be being used to hide
all the loop control overhead -- I only unrolled the loop to 64 bytes,
and that gives almost full speed.  The loop must be unrolled a bit
since a single load-store runs too fast for loop control.  For the
movnti benchmarks, the load-store runs slower than the loop control
so unrolling is not needed.

Bruce
Received on Thu Jan 18 2007 - 06:03:39 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:04 UTC