:Fully cached case: :% copy0: 16907105855 B/s ( 242265 us) (movsq) :% ... :% copyQ: 13658478027 B/s ( 299887 us) (movdqa) :% ... :% copyT: 16456408196 B/s ( 248900 us) (movq (i)) : :This doesn't contradict your claim since main memory is not really involved. :Everything should be limited by instruction and cache bandwidth, but for :some reason movdqa is quite slow despite or because of my attempts to :schedule it perfectly. Well, that's one of the problems. Actually two of the problems. First, prefetching tends to add more confusion then it solves when used in a copy loop, because for all intents and purposes you are already keeping the memory pipeline full simply by the fact that the writes are buffered. By the time the loop gets back around to the read it has to stall anyway because the writes are still being pushed out to memory (read-before-write ordering doesn't help because all that accomplishes is to force the write buffer to stay perpetually full (possibly in a non-optimal fashion), and the cpu stalls anyway. I've tried every combination of prefetching and non-temporal instructions imagineable and it has reduced performance more often then it has improved it. I can write an optimal copy loop for particular situations but the problem is that the same loop becomes extremely inefficient under any *OTHER* conditions. The behavior of any given copy algorithm is radically different if either the source or target are present in the L1 or L2 caches, verses if they aren't. The algorithm one uses to do tiny mostly in-cache copies is totally different from the algorithm one uses to do large bulk copies. The algorithm one uses where the source data is partially or fully cached is totally different from the one used if the target data is partially or fully cached, or if neither is cached, or if both are cached. :Fully uncached case: :% copy0: 829571300 B/s ( 493749 us) (movsq) :% ... :% copyQ: 835822845 B/s ( 490056 us) (movdqa) :% copyR: 908423544 B/s ( 450891 us) (movdqa with prefetchnta) :% copyS: 919026496 B/s ( 445689 us) (movdqa with block prefetch) :% copyT: 828993732 B/s ( 494093 us) (movq (i)) :% copyU: 905141362 B/s ( 452526 us) (movq (i) with prefetchnta) :% copyV: 910626945 B/s ( 449800 us) (movq (i) with block prefetch) :% copyW: 1350397932 B/s ( 303318 us) (movnti) :% copyX: 1662594069 B/s ( 246362 us) (movnti with prefetchnta) :% copyY: 1608204355 B/s ( 254694 us) (movnti with block prefetch) : :Now movdqa beats movsq by a bit, at least with prefetch, but has the :same speed as integer moves of half the size (or MMX moves of half the :size (benchmark data not shown)). It is necessary to use movnt to get :anywhere near memory bandwidth, and any form of movnt can be used for :this (I test mainly movntps in addition to the above, since it is most :portable). There is apparently some magic to combine 8-byte writes :if necessary for efficient main memory accesses. The above was run :on an old A64 overclocked a lot to 2204 MHz, with mediocre DDR-1 PC3200 :memory overclocked a bit and tuned a lot. The maximum bandwidth :achieved is almost exactly PC3200 but the theoretical maximum is more :like PC3400. But you can't safely use *ANY* non-temporal instruction unless you know, specifically, that the data is uncached. If you use a non-temporal instruction in a situation where the data is cached or can be easily cached, you destroy the performance for that case by causing the cpu not to cache data that it really should cache. Use of non-temporal instructions results in a non-self-healing algorithm... it is possible to get into a situation the is permanently non-optimal. That's why I don't use those instructions. Sometimes I wish the non-temporal instructions had a random component. That is, that they randomly operates as normal moves instead of nt moves for a small percentage of executions (~1-25%) in order to create a self-healing (performance wise) algorithm. :> :> Alignment is critical. If the data is not aligned, don't bother. 128 :> bits means 16 byte alignment. : :The above benchmark output is for aligned data :-). I don't try hard to :optimize or benchmark misaligned cases. Neither do I. I expect if the program wants to copy a large amount of data that the programmer is smart enough to align it. :... :- movnt* is a clear win in at least one case: zeroing pages in idle. : FreeBSD on i386 uses movnti for all zeroing of pages in the SSE2 case. : This might not be best for non-idle zeroing. Maybe it should be used : only for multiple pages. For single pages, it is probably for a : pagefault and you want at least the data near the fault address in : the cache. FreeBSD on amd64 also uses movnti for all copying of pages. Use of non-temporal instructions for non performance-critical operations is a clear win when executed from the idle loop, I agree. Use of NT instructions in any other case is difficult because it is virtually impossible to predict the conditions under which other cases occur. :I think SunOS on amd64 uses movnt* whenever the copy size is >= 128. This :seems too small. : :> I've given up trying to use either mechanism. Instead, I prefer :> copying as large a block as possible to remove these variables from :> the cpu pipeline entirely. The cpu has a write fifo anyway, you :> don't need prefetch instructions if you can use instructions to write :> to memory faster then available L2 cache bandwidth. On some cpus :> this mandates the use of 64 or 128 bit media instructions or the :> cpu can't keep the write FIFO full and starts interleaving reads :> and writes on the wrong boundaries (creating more RAS cycles, which :> is what kills copy bandwidth). : :Which modern CPUs can't keep up with L2? The instruction pipeline is the problem, not the L2 cache bandwidth. Things get a bit weird when every single instruction is reading or writing memory. I try to design algorithms that work reasonably well across a large swath of cpus. It turns out that algorithms which do block loads followed by block stores tend to maintain consistent and good performance across a large swath of cpus. I outline the reasons why later on. :On amd64 for the fully-L2-cached case: :% copy0: 4382675165 B/s ( 934589 us) (movsq) :% ... :% copyN: 3432351420 B/s (1193351 us) (movntq) :% copyO: 1777008820 B/s (2304997 us) (movntq with prefetchnta) :% copyP: 3278961491 B/s (1249176 us) (movntq with block prefetch) :% copyQ: 4369052686 B/s ( 937503 us) (movdqa) :% copyR: 2785886655 B/s (1470268 us) (movdqa with prefetchnta) :% copyS: 4553271385 B/s ( 899573 us) (movdqa with block prefetch) :% copyT: 4401466152 B/s ( 930599 us) (movq (i)) :% copyU: 2817507895 B/s (1453767 us) (movq (i) with prefetchnta) :% copyV: 4523562615 B/s ( 905481 us) (movq (i) with block prefetch) :% copyW: 3432429080 B/s (1193324 us) (movnti) :% copyX: 1776507852 B/s (2305647 us) (movnti with prefetchnta) :% copyY: 3136767185 B/s (1305803 us) (movnti with block prefetch) : :So here is a case where prefetchnta is bad. OTOH, movnti is quite fast :provided prefetchnta is not used with it. movnti is even more competitive :with L2 if main memory is DDR-2. It also depends *WHERE* in the loop you put the prefetchnta. I was able to get fairly close to block prefetching speeds using prefetchnta by carefully locating the instruction and adjusting the read-ahead point. For example, by locating the instruction before the write side and indexing the prefetched memory address 128 bytes ahead of the curve. The placement of the instruction is sensitive down to the instruction slot... moving it one forward or one back can result in crazy differences in performance (due to forced RAS cycles I believe, in particular with prefetchnta). So two variables.. how far ahead you try to prefetch (which depends heavily on the memory architecture, banking, instruction pipeline size, etc), and where you locate the instruction. Two impossible variables. I could optimize a particular case, but wound up nerfing every *OTHER* case. I finally gave up. :... :> But at this point the alignment requirements start to get kinda :> silly. 128 byte alignment requirement? I don't think so. I :> do a 16-byte alignment check in DragonFly as a pre-req for using :> XMM and that's it. : ::-). There are just too many combinations to even think about. I prefer :to reduce the number of cases by depending on the prefetching working :reasonably well. Yup. :> I don't know about AMD64. You only have 64 bit general registers in 64 :> bit mode so you may not be able to keep the write pipeline full. But :> you do have 8 of them so you are roughly equivalent to MMX (but not : 16, almost :> XMM's 8 128 bit registers). : :On A64 and AXP, most things are limited by the load/store unit (LSU), :the 3 integer pipelines, and SSE latency (but latency is not a problem :here). : :Load/store-unit: : : (lots of good information about load-store units) Well, but these are for very specific flavor-of-the-day hardware tweaks. The designers change these parameters literally every other month. All that can really be said here is: Avoid 32 bit accesses if you can, 64 bit accesses seem to be the most dependable, and 128 bit accesses are the up-and-coming thing. In particular I remember reading that AMD had been spending a lot of time optimizing XMM register loads and stores. I presume that Intel is doing the same thing since both are going face forward into gaming. I'm not surprised at all about movs* performance. That instruction was originally microcoded. AMD tried harder to optimize it then Intel, but it still doesn't work very well simply due to instruction pipeline constraints. The hardware goes nuts trying to optimize movs*. :The asymmetric load/store capabilities on the AXP can't be good for using :128-bit XMM registers, especially using 8 loads followed by 8 stores in :the instruction stream. They make 128 stores want to take twice as long :as 2 64-bits ones, and it takes a large amount of out of order execution :to reorder things to reach the "1 64-bit load and 1 64-bit store" per :cycle pattern. Reordering is exactly what you don't want for controlling :the interleaving of access to main memory. Even with a simple load-store- :load-store with 64-bit accesses in the instruction, stream, the there :will be some reordering -- the reads will get 1 ahead of the writes. :This explains why my movdqa benchmark is slow on AXP but not why it is :slow on A64. The advantages of doing N loads followed by N stores (where N >= 4 and operates on 64 or 128 bit entities) are as follows: * There is no instruction or register reordering (or very little) because the data loaded by any given instruction is not used until e.g. 8 instructions later. * The write buffer is ordered (either through to the L2 cache or to main memory, it doesn't matter). This enforces a degree of resynchronization of the instruction stream. That is, it creates a self-healing situation that makes it possible to write code that works very well under all sorts of conditions. * The read instructions should theoretically tend not to be reordered simply because the memory pipeline imposes a stall due to handling previously buffered writes and because of the burst nature of linear reads from main memory. Even if the cpu DID reorder the reads, it can only actually retire instructions as the data actually comes in from main memory. Such data will always be fairly well ordered in either 0-1-2-3 or 3-2-1-0 bursts, or at the minimum cache-line bursts. Some reordering might occur if some of the read instructions access cached data, but even in that case I just don't think it happens. There is simply nothing there for the cpu to reorder. * I don't think the cpu could reorder reads or reorder reads and writes under these conditions even if it wanted to, and we don't want it to. * Write combining is overrated. It serves an important purpose, but it is a bad idea to actually depend on it unless your algorithm is self-healing from the point of view of resynchronizing the pipeline and/or memory read and write ordering. Instruction pipelines and cache pipelines are quite evenly matched these days (usually no more then 1:2 performance) so depending on write combining will actually cause chain stalls in the cache pipeline (not using a cache cycle that was potentially available to use) when the instruction pipeline stalls on a memory read, because the cpu wanted to try to combine the writes. These are boundary conditions for sure. It is best to remember that the instruction pipeline is stalling several times on every loop due to main memory accesses. You can't avoid it, but you can try to manage it. I am not familiar enough with the architectures to know whether the cpu does burst reads from main memory into the cache... e.g. 4 cache line bursts at a time. My assumption has always been that the cpu does do burst reads, because it's so easy to do. Any sort of burst reading into the cache by the cpu only creates an advantage for this architecture because there are no dependancies between instructions for 8 instruction slots and because it enforces an ordering to both read and write instructions that tends to self-heal the algorithm from a performance standpoint. So, in summary... the reason I use block loads and stores (sequences of 8 MMX or XMM loads followed by 8 MMX or XMM stores) is that these sorts of inner loops appear to generate very good results under all operating conditions. When I have attempted to use prefetcha, prefetchnta, or non-temporal moves (movnt* instead of movdq*) the result has been better performance only under certain very carefully controlled circumstances, and terrible performance otherwise. IMHO, I do believe that 64 bit loads and stores is plenty good enough for today's cpus, but it is also clear that 128 bit loads and stores will be better for tomorrow's cpus and works at least as well on today's cpus as 64 bit loads and stores, at least insofar as 128 bit registers can be made available to the kernel. -- It should be possible to use MMX/XMM registers in the kernel simply as call-saved registers by issuing a FWAIT or equivalent at the user-kernel boundary to take care of any exceptions (instead of saving the state and calling clts ; fnclex). Presumably by the time the cpu actually gets the integer and cpu state context pushed any FP operations will have long sinced completed. But I haven't timed it to find out whether that actually works. The nice thing about most MMX/XMM media instructions is that they don't actually care about what state the FP unit is in because they aren't actually doing any FP math. It is just a matter of synchronizing the free FP registers from some prior user FP operation which may not have posted results yet. If this IS possible then it removes most of the NPXTHREAD junk from the kernel bcopy path for the case where NPXTHREAD != NULL, giving you the ability to use the FP optimally whether NPXTHREAD == NULL or NPXTHREAD != NULL. And, in such a case, it might be more beneficial for small copies to only save two or four XMM registers instead of all eight. What do you think? -Matt Matthew Dillon <dillon_at_backplane.com> :Bruce :Received on Thu Jan 18 2007 - 19:06:51 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:04 UTC