Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs

From: Matthew Dillon <dillon_at_apollo.backplane.com>
Date: Thu, 18 Jan 2007 11:48:39 -0800 (PST)
:Fully cached case:
:% copy0: 16907105855 B/s ( 242265 us) (movsq)
:% ...
:% copyQ: 13658478027 B/s ( 299887 us) (movdqa)
:% ...
:% copyT: 16456408196 B/s ( 248900 us) (movq (i))
:
:This doesn't contradict your claim since main memory is not really involved.
:Everything should be limited by instruction and cache bandwidth, but for
:some reason movdqa is quite slow despite or because of my attempts to
:schedule it perfectly.

    Well, that's one of the problems.  Actually two of the problems.
    First, prefetching tends to add more confusion then it solves when
    used in a copy loop, because for all intents and purposes you are 
    already keeping the memory pipeline full simply by the fact that
    the writes are buffered.  By the time the loop gets back around to
    the read it has to stall anyway because the writes are still being
    pushed out to memory (read-before-write ordering doesn't help because
    all that accomplishes is to force the write buffer to stay perpetually
    full (possibly in a non-optimal fashion), and the cpu stalls anyway.

    I've tried every combination of prefetching and non-temporal 
    instructions imagineable and it has reduced performance more often
    then it has improved it.  I can write an optimal copy loop for
    particular situations but the problem is that the same loop becomes
    extremely inefficient under any *OTHER* conditions.

    The behavior of any given copy algorithm is radically different
    if either the source or target are present in the L1 or L2 caches, 
    verses if they aren't.  The algorithm one uses to do tiny mostly
    in-cache copies is totally different from the algorithm one uses
    to do large bulk copies.  The algorithm one uses where the
    source data is partially or fully cached is totally different from the
    one used if the target data is partially or fully cached, or if neither
    is cached, or if both are cached.

:Fully uncached case:
:% copy0:  829571300 B/s ( 493749 us) (movsq)
:% ...
:% copyQ:  835822845 B/s ( 490056 us) (movdqa)
:% copyR:  908423544 B/s ( 450891 us) (movdqa with prefetchnta)
:% copyS:  919026496 B/s ( 445689 us) (movdqa with block prefetch)
:% copyT:  828993732 B/s ( 494093 us) (movq (i))
:% copyU:  905141362 B/s ( 452526 us) (movq (i) with prefetchnta)
:% copyV:  910626945 B/s ( 449800 us) (movq (i) with block prefetch)
:% copyW: 1350397932 B/s ( 303318 us) (movnti)
:% copyX: 1662594069 B/s ( 246362 us) (movnti with prefetchnta)
:% copyY: 1608204355 B/s ( 254694 us) (movnti with block prefetch)
:
:Now movdqa beats movsq by a bit, at least with prefetch, but has the
:same speed as integer moves of half the size (or MMX moves of half the
:size (benchmark data not shown)).  It is necessary to use movnt to get
:anywhere near memory bandwidth, and any form of movnt can be used for
:this (I test mainly movntps in addition to the above, since it is most
:portable).  There is apparently some magic to combine 8-byte writes
:if necessary for efficient main memory accesses.  The above was run
:on an old A64 overclocked a lot to 2204 MHz, with mediocre DDR-1 PC3200
:memory overclocked a bit and tuned a lot.  The maximum bandwidth
:achieved is almost exactly PC3200 but the theoretical maximum is more
:like PC3400.

    But you can't safely use *ANY* non-temporal instruction unless you
    know, specifically, that the data is uncached.  If you use a non-temporal
    instruction in a situation where the data is cached or can be easily
    cached, you destroy the performance for that case by causing the cpu
    not to cache data that it really should cache.  Use of non-temporal
    instructions results in a non-self-healing algorithm... it is possible
    to get into a situation the is permanently non-optimal.  That's why
    I don't use those instructions.

    Sometimes I wish the non-temporal instructions had a random component.
    That is, that they randomly operates as normal moves instead of nt
    moves for a small percentage of executions (~1-25%) in order to create
    a self-healing (performance wise) algorithm.

:>
:>      Alignment is critical.  If the data is not aligned, don't bother.  128
:>      bits means 16 byte alignment.
:
:The above benchmark output is for aligned data :-).  I don't try hard to
:optimize or benchmark misaligned cases.

    Neither do I.  I expect if the program wants to copy a large amount of
    data that the programmer is smart enough to align it.

:...
:- movnt* is a clear win in at least one case: zeroing pages in idle.
:   FreeBSD on i386 uses movnti for all zeroing of pages in the SSE2 case.
:   This might not be best for non-idle zeroing.  Maybe it should be used
:   only for multiple pages.  For single pages, it is probably for a
:   pagefault and you want at least the data near the fault address in
:   the cache.  FreeBSD on amd64 also uses movnti for all copying of pages.

    Use of non-temporal instructions for non performance-critical operations
    is a clear win when executed from the idle loop, I agree.  Use of NT
    instructions in any other case is difficult because it is virtually 
    impossible to predict the conditions under which other cases occur.

:I think SunOS on amd64 uses movnt* whenever the copy size is >= 128.  This
:seems too small.
:
:>      I've given up trying to use either mechanism.  Instead, I prefer
:>      copying as large a block as possible to remove these variables from
:>      the cpu pipeline entirely.  The cpu has a write fifo anyway, you
:>      don't need prefetch instructions if you can use instructions to write
:>      to memory faster then available L2 cache bandwidth.  On some cpus
:>      this mandates the use of 64 or 128 bit media instructions or the
:>      cpu can't keep the write FIFO full and starts interleaving reads
:>      and writes on the wrong boundaries (creating more RAS cycles, which
:>      is what kills copy bandwidth).
:
:Which modern CPUs can't keep up with L2?

    The instruction pipeline is the problem, not the L2 cache bandwidth.
    Things get a bit weird when every single instruction is reading or
    writing memory.  I try to design algorithms that work reasonably well
    across a large swath of cpus.  It turns out that algorithms which do
    block loads followed by block stores tend to maintain consistent and
    good performance across a large swath of cpus.  I outline the reasons
    why later on.

:On amd64 for the fully-L2-cached case:
:% copy0: 4382675165 B/s ( 934589 us) (movsq)
:% ...
:% copyN: 3432351420 B/s (1193351 us) (movntq)
:% copyO: 1777008820 B/s (2304997 us) (movntq with prefetchnta)
:% copyP: 3278961491 B/s (1249176 us) (movntq with block prefetch)
:% copyQ: 4369052686 B/s ( 937503 us) (movdqa)
:% copyR: 2785886655 B/s (1470268 us) (movdqa with prefetchnta)
:% copyS: 4553271385 B/s ( 899573 us) (movdqa with block prefetch)
:% copyT: 4401466152 B/s ( 930599 us) (movq (i))
:% copyU: 2817507895 B/s (1453767 us) (movq (i) with prefetchnta)
:% copyV: 4523562615 B/s ( 905481 us) (movq (i) with block prefetch)
:% copyW: 3432429080 B/s (1193324 us) (movnti)
:% copyX: 1776507852 B/s (2305647 us) (movnti with prefetchnta)
:% copyY: 3136767185 B/s (1305803 us) (movnti with block prefetch)
:
:So here is a case where prefetchnta is bad.  OTOH, movnti is quite fast
:provided prefetchnta is not used with it.  movnti is even more competitive
:with L2 if main memory is DDR-2.

    It also depends *WHERE* in the loop you put the prefetchnta.  I was
    able to get fairly close to block prefetching speeds using
    prefetchnta by carefully locating the instruction and adjusting 
    the read-ahead point.  For example, by locating the instruction
    before the write side and indexing the prefetched memory address
    128 bytes ahead of the curve.  The placement of the instruction is
    sensitive down to the instruction slot... moving it one forward or
    one back can result in crazy differences in performance (due to
    forced RAS cycles I believe, in particular with prefetchnta).

    So two variables.. how far ahead you try to prefetch (which depends
    heavily on the memory architecture, banking, instruction pipeline size,
    etc), and where you locate the instruction.  Two impossible variables.

    I could optimize a particular case, but wound up nerfing every *OTHER*
    case.  I finally gave up.

:...
:>      But at this point the alignment requirements start to get kinda
:>      silly.  128 byte alignment requirement?  I don't think so.  I
:>      do a 16-byte alignment check in DragonFly as a pre-req for using
:>      XMM and that's it.
:
::-).  There are just too many combinations to even think about.  I prefer
:to reduce the number of cases by depending on the prefetching working
:reasonably well.

    Yup.

:>    I don't know about AMD64.  You only have 64 bit general registers in 64
:>    bit mode so you may not be able to keep the write pipeline full.  But
:>    you do have 8 of them so you are roughly equivalent to MMX (but not
:                  16, almost
:>    XMM's 8 128 bit registers).
:
:On A64 and AXP, most things are limited by the load/store unit (LSU),
:the 3 integer pipelines, and SSE latency (but latency is not a problem
:here).
:
:Load/store-unit:
:
: (lots of good information about load-store units)

    Well, but these are for very specific flavor-of-the-day hardware tweaks.
    The designers change these parameters literally every other month.

    All that can really be said here is: Avoid 32 bit accesses if you can,
    64 bit accesses seem to be the most dependable, and 128 bit accesses
    are the up-and-coming thing.

    In particular I remember reading that AMD had been spending a lot 
    of time optimizing XMM register loads and stores.  I presume that
    Intel is doing the same thing since both are going face forward
    into gaming.

    I'm not surprised at all about movs* performance.  That instruction was
    originally microcoded.  AMD tried harder to optimize it then Intel,
    but it still doesn't work very well simply due to instruction pipeline
    constraints.  The hardware goes nuts trying to optimize movs*.

:The asymmetric load/store capabilities on the AXP can't be good for using
:128-bit XMM registers, especially using 8 loads followed by 8 stores in
:the instruction stream.  They make 128 stores want to take twice as long
:as 2 64-bits ones, and it takes a large amount of out of order execution
:to reorder things to reach the "1 64-bit load and 1 64-bit store" per
:cycle pattern.  Reordering is exactly what you don't want for controlling
:the interleaving of access to main memory.  Even with a simple load-store-
:load-store with 64-bit accesses in the instruction, stream, the there
:will be some reordering -- the reads will get 1 ahead of the writes.
:This explains why my movdqa benchmark is slow on AXP but not why it is
:slow on A64.

    The advantages of doing N loads followed by N stores (where N >= 4
    and operates on 64 or 128 bit entities) are as follows:

    * There is no instruction or register reordering (or very little)
      because the data loaded by any given instruction is not used until
      e.g. 8 instructions later.

    * The write buffer is ordered (either through to the L2 cache or to
      main memory, it doesn't matter).  This enforces a degree of 
      resynchronization of the instruction stream.  That is, it creates
      a self-healing situation that makes it possible to write code
      that works very well under all sorts of conditions.

    * The read instructions should theoretically tend not to be reordered
      simply because the memory pipeline imposes a stall due to handling
      previously buffered writes and because of the burst nature of linear
      reads from main memory.  Even if the cpu DID reorder the reads,
      it can only actually retire instructions as the data actually comes
      in from main memory.  Such data will always be fairly well ordered
      in either 0-1-2-3 or 3-2-1-0 bursts, or at the minimum cache-line
      bursts.

      Some reordering might occur if some of the read instructions access
      cached data, but even in that case I just don't think it happens.
      There is simply nothing there for the cpu to reorder.

    * I don't think the cpu could reorder reads or reorder reads and writes
      under these conditions even if it wanted to, and we don't want it
      to.

    * Write combining is overrated.  It serves an important purpose, but
      it is a bad idea to actually depend on it unless your algorithm
      is self-healing from the point of view of resynchronizing the 
      pipeline and/or memory read and write ordering.  Instruction
      pipelines and cache pipelines are quite evenly matched these
      days (usually no more then 1:2 performance) so depending on write
      combining will actually cause chain stalls in the cache pipeline
      (not using a cache cycle that was potentially available to use)
      when the instruction pipeline stalls on a memory read, because
      the cpu wanted to try to combine the writes.

      These are boundary conditions for sure.  It is best to remember 
      that the instruction pipeline is stalling several times on every
      loop due to main memory accesses.  You can't avoid it, but you can
      try to manage it.

    I am not familiar enough with the architectures to know whether the cpu
    does burst reads from main memory into the cache... e.g. 4 cache line
    bursts at a time.  My assumption has always been that the cpu does do
    burst reads, because it's so easy to do.  Any sort of burst reading 
    into the cache by the cpu only creates an advantage for this
    architecture because there are no dependancies between instructions
    for 8 instruction slots and because it enforces an ordering to both
    read and write instructions that tends to self-heal the algorithm
    from a performance standpoint.

    So, in summary... the reason I use block loads and stores (sequences of
    8 MMX or XMM loads followed by 8 MMX or XMM stores) is that these sorts
    of inner loops appear to generate very good results under all operating
    conditions.  When I have attempted to use prefetcha, prefetchnta, or
    non-temporal moves (movnt* instead of movdq*) the result has been 
    better performance only under certain very carefully controlled
    circumstances, and terrible performance otherwise.  IMHO, I do believe
    that 64 bit loads and stores is plenty good enough for today's cpus,
    but it is also clear that 128 bit loads and stores will be better for
    tomorrow's cpus and works at least as well on today's cpus as 64 bit
    loads and stores, at least insofar as 128 bit registers can be made
    available to the kernel.

    --

    It should be possible to use MMX/XMM registers in the kernel simply as
    call-saved registers by issuing a FWAIT or equivalent at the user-kernel
    boundary to take care of any exceptions (instead of saving the state
    and calling clts ; fnclex).  Presumably by the time the cpu actually
    gets the integer and cpu state context pushed any FP operations will
    have long sinced completed.  But I haven't timed it to find out whether
    that actually works.

    The nice thing about most MMX/XMM media instructions is that they don't
    actually care about what state the FP unit is in because they aren't 
    actually doing any FP math.  It is just a matter of synchronizing the
    free FP registers from some prior user FP operation which may not have
    posted results yet.

    If this IS possible then it removes most of the NPXTHREAD junk from the
    kernel bcopy path for the case where NPXTHREAD != NULL, giving you the
    ability to use the FP optimally whether NPXTHREAD == NULL or
    NPXTHREAD != NULL.  And, in such a case, it might be more beneficial
    for small copies to only save two or four XMM registers instead of
    all eight.

    What do you think?

					-Matt
					Matthew Dillon 
					<dillon_at_backplane.com>

:Bruce
:
Received on Thu Jan 18 2007 - 19:06:51 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:04 UTC