On Wed, 17 Jan 2007, Matthew Dillon wrote: > The cost of using the FPU can simply be thought of in terms of how > many bytes you have to have to copy for it to become worth using > the FPU over a far less complex integer copy loop. It's not that simple... > ... > In fact, I would say that if userland is not using the FP unit, > that is npxthread == NULL or npxthread != curthread, you should > *DEFINITELY* use the FP unit. Hands down, no question about it. Only if the raw FPU is actually faster. > * First, raw memory bandwidth is governed by RAS cycles. The fewer RAS > cycles you have, the higher the bandwidth. > > This means that the more data you can load into the cpu on the 'read' > side of the copy before transitioning to the 'write' side, the better. > > With XMM you can load 128 *BYTES* a shot (8 128 bit registers). For > large copies, nothing beats it. No, this is very machine-dependent. Read-write transitioning is slow, but it is almost a non-problem since it can be avoided by prefetchiung the read side. Dirtying the cache with the read side is unavoidable on current i386-amd64 CPUs. Prefetching makes this a feature: prefetch N bytes, preferably using a working prefetchnta (1) write N bytes, possibly to the cache (2), possibly many less than just one XMM register's worth at a time (3) Just make N much larger than 128 for this to easily beat simply loading and storing 128 bytes at a time (4) (5). (1) If the source is already in the cache, the extra instructions for the prefetch are a bit wasteful. Using prefetchnta limits the waste, perhaps to 0 depending on whether prefetchnta can be executed in an otherwise unised pipeline. However, a working prefetchnta is required for the prefetch to actually occur in the case that the source isn't already in the cache. A64's have a working prefetchnta, but AXP's don't. (2) Reads and writes through the L1 cache give much the same benefits as reads and writes through many registers. Reads must go through the L1 cache. (3) Write combining of sequential writes makes the number of memory accesses independent of the instruction operand size, at least for cached writes. If writes are direct, then the issues are different. I've never seen 128-bit nontemporal writes go faster than 64-bit ones so maybe they are combined too. (I may have seen this on AXP while writing this.) (4) Using only registers 128 bytes at a time may beat prefetch in cases where the prefetch is too wasteful. However, I think these cases are limited ones where the copy size is small so you wouldn't want to use prefetch or nontemporal writes and any benefits from using XMM registers are marginal (because of the context switching overheads). Such cases are handled well by caching. If the cache line size is 128, then read/write through the cache gives RAS cycles that are independent of the instruction operand sizes. Using 8*16 bytes from registers can only win if the cache line size is 64. (5) A64's have 16 not-very general integer 64-bit integer registers and 16 XMM registers, so you can get close to 128 bytes using integer registers (16*8) and can get 256 bytes using XMM registers (16*16). Caching and/or write buffering turns writes of the "small" 64-bit registers or even 32-bit registers into much larger writes, so levels of the memory system below L1 don't notice if we write tiny amounts at a time provided there is enough instruction bandwidth to keep up. > * Modern cpu hardware uses a 128 bit data path for 128 bit media > instructions and can optimize the 128 bit operation all the way through > to a cache line or to main memory. It can't be beat. Actually, integer instructions beat it easily on A64: Fully cached case: % copy0: 16907105855 B/s ( 242265 us) (movsq) % ... % copyQ: 13658478027 B/s ( 299887 us) (movdqa) % ... % copyT: 16456408196 B/s ( 248900 us) (movq (i)) This doesn't contradict your claim since main memory is not really involved. Everything should be limited by instruction and cache bandwidth, but for some reason movdqa is quite slow despite or because of my attempts to schedule it perfectly. Fully uncached case: % copy0: 829571300 B/s ( 493749 us) (movsq) % ... % copyQ: 835822845 B/s ( 490056 us) (movdqa) % copyR: 908423544 B/s ( 450891 us) (movdqa with prefetchnta) % copyS: 919026496 B/s ( 445689 us) (movdqa with block prefetch) % copyT: 828993732 B/s ( 494093 us) (movq (i)) % copyU: 905141362 B/s ( 452526 us) (movq (i) with prefetchnta) % copyV: 910626945 B/s ( 449800 us) (movq (i) with block prefetch) % copyW: 1350397932 B/s ( 303318 us) (movnti) % copyX: 1662594069 B/s ( 246362 us) (movnti with prefetchnta) % copyY: 1608204355 B/s ( 254694 us) (movnti with block prefetch) Now movdqa beats movsq by a bit, at least with prefetch, but has the same speed as integer moves of half the size (or MMX moves of half the size (benchmark data not shown)). It is necessary to use movnt to get anywhere near memory bandwidth, and any form of movnt can be used for this (I test mainly movntps in addition to the above, since it is most portable). There is apparently some magic to combine 8-byte writes if necessary for efficient main memory accesses. The above was run on an old A64 overclocked a lot to 2204 MHz, with mediocre DDR-1 PC3200 memory overclocked a bit and tuned a lot. The maximum bandwidth achieved is almost exactly PC3200 but the theoretical maximum is more like PC3400. > > Alignment is critical. If the data is not aligned, don't bother. 128 > bits means 16 byte alignment. The above benchmark output is for aligned data :-). I don't try hard to optimize or benchmark misaligned cases. > * No extranious memory writes, no uncached extranious memory reads. > If you do any writes to memory other then to the copy destination > in your copy loop you screw up the cpu's write fifo and destroy > performance. > > Systems are so sensitive to this that it is even better to spend the > time linearly mapping large copy spaces into KVM and do a single > block copy then to have an inner per-PAGE loop. I haven't tried this, but have seen and partly worked sensitivity to linear KVA maps not being physically (non)linear enough. Some CPUs and/or memory systems are remarkably sensitive to bank interleave. FreeBSD's page coloring doesn't know anything about banks, and accidentally starts up with perfect miscoloring for banks. This can make a difference of 30% for bzero bandwidth in benchmarks (not so much for bcopy bandwidth, and an insignificant amount for normal use). After the system warms up, the coloring becomes random with respect to banks, and random coloring works much better than perfect miscoloring. > > * Use of prefetch or use of movntdq instead of movdqa is highly > problematic. It is possible to use these to optimize very particular > cases but the problem is they tend to nerf all OTHER cases. This is very machine-dependent: - prefetchnta seems to work fine on A64 but not on AXP or P4 (see above and other postings. I haven't tested it much for small copy sizes. I never saw older prefetch instructions that give more control doing anything useful. - movntps may be free (once you use XMM and if ps instead of dq is OK) on AXP, since it runs at cache speed if the target is already cached, It is not free on A64 since it always runs at memory speed. I think ps instead of dq is not OK since ps is slow if there are denormals in the data, but don't remember if benchmarks support this. Unfortunately, the only movnt instruction in AXP is movntps. - movnt* is a clear win in at least one case: zeroing pages in idle. FreeBSD on i386 uses movnti for all zeroing of pages in the SSE2 case. This might not be best for non-idle zeroing. Maybe it should be used only for multiple pages. For single pages, it is probably for a pagefault and you want at least the data near the fault address in the cache. FreeBSD on amd64 also uses movnti for all copying of pages. I think SunOS on amd64 uses movnt* whenever the copy size is >= 128. This seems too small. > I've given up trying to use either mechanism. Instead, I prefer > copying as large a block as possible to remove these variables from > the cpu pipeline entirely. The cpu has a write fifo anyway, you > don't need prefetch instructions if you can use instructions to write > to memory faster then available L2 cache bandwidth. On some cpus > this mandates the use of 64 or 128 bit media instructions or the > cpu can't keep the write FIFO full and starts interleaving reads > and writes on the wrong boundaries (creating more RAS cycles, which > is what kills copy bandwidth). Which modern CPUs can't keep up with L2? On amd64 for the fully-L2-cached case: % copy0: 4382675165 B/s ( 934589 us) (movsq) % ... % copyN: 3432351420 B/s (1193351 us) (movntq) % copyO: 1777008820 B/s (2304997 us) (movntq with prefetchnta) % copyP: 3278961491 B/s (1249176 us) (movntq with block prefetch) % copyQ: 4369052686 B/s ( 937503 us) (movdqa) % copyR: 2785886655 B/s (1470268 us) (movdqa with prefetchnta) % copyS: 4553271385 B/s ( 899573 us) (movdqa with block prefetch) % copyT: 4401466152 B/s ( 930599 us) (movq (i)) % copyU: 2817507895 B/s (1453767 us) (movq (i) with prefetchnta) % copyV: 4523562615 B/s ( 905481 us) (movq (i) with block prefetch) % copyW: 3432429080 B/s (1193324 us) (movnti) % copyX: 1776507852 B/s (2305647 us) (movnti with prefetchnta) % copyY: 3136767185 B/s (1305803 us) (movnti with block prefetch) So here is a case where prefetchnta is bad. OTOH, movnti is quite fast provided prefetchnta is not used with it. movnti is even more competitive with L2 if main memory is DDR-2. > * RAS transitions also have to be aligned or you get boundary cases > when the memory address transitions a RAS line. This again mandates > maximal alignment (even more then 16 bytes, frankly, which is why > being able to do 128 byte blocks with XMM registers is so nice). > Even though reads and writes are reblocked to the cache line size > by the cpu, your inner loop can still transition a RAS boundary in > the middle of a large block read if it isn't aligned. > > But at this point the alignment requirements start to get kinda > silly. 128 byte alignment requirement? I don't think so. I > do a 16-byte alignment check in DragonFly as a pre-req for using > XMM and that's it. :-). There are just too many combinations to even think about. I prefer to reduce the number of cases by depending on the prefetching working reasonably well. > I don't know about AMD64. You only have 64 bit general registers in 64 > bit mode so you may not be able to keep the write pipeline full. But > you do have 8 of them so you are roughly equivalent to MMX (but not 16, almost > XMM's 8 128 bit registers). On A64 and AXP, most things are limited by the load/store unit (LSU), the 3 integer pipelines, and SSE latency (but latency is not a problem here). Load/store-unit: On A64, the LSU can do the following number of interesting L1 cache accesses per cycle: 2 64-bit loads 2 64-bit stores 1 64-bit load and 1 64-bit store all to the L1 cache. When you load or store a 128-bit XMM register, it takes the same load/store-unit resources as 2 64-bit integer loads or stores of the same type. Thus XMM and integer operations can both achieve the maximum L1 cache bandwidth, but no more of course, so XMM vs integer is irrelevant for bandwidth considerations for the L1 level, and since lower levels need less bandwidth, it is irrelevant for all levels except possibly the non-cache-level given by nontemporal stores. On AXP, the LSU can do the following number of interesting L1 cache accesses per cycle: 2 64-bit loads 2 32-bit store 1 64-bit store 1 64-bit load and 1 64-bit store Since there are no 64-bit integer registers, you have to use MMX or XMM to get the 64-bit accesses. It doesn't matter whether you use 2*MMX or 1*XMM to get to 128 bits -- both become 2*64 bits. 32-bit integer stores can only use half of the L1 cache bandwidth. 32-bit integer loads are also limited to 2 per cycle since they apparently aren't combined at this level. Combining also doesn't happen for movsl (this is apparently implemented as a 32-bit load/store). On A64 in 32-bit mode, the LSU can do 2 64-bit stores/cycle, but this only helps using XMM registers -- 32-bit accesses apparently never get combined at this level, so they always go at half speed. The asymmetric load/store capabilities on the AXP can't be good for using 128-bit XMM registers, especially using 8 loads followed by 8 stores in the instruction stream. They make 128 stores want to take twice as long as 2 64-bits ones, and it takes a large amount of out of order execution to reorder things to reach the "1 64-bit load and 1 64-bit store" per cycle pattern. Reordering is exactly what you don't want for controlling the interleaving of access to main memory. Even with a simple load-store- load-store with 64-bit accesses in the instruction, stream, the there will be some reordering -- the reads will get 1 ahead of the writes. This explains why my movdqa benchmark is slow on AXP but not why it is slow on A64. Benchmark output on AXP, fully cached case: % copy0: 5275909727 B/s ( 776359 us) (1417037051 tsc) (movsl) % ... % copyD: 6301722666 B/s ( 649981 us) (1175886987 tsc) (kernel bcopy (unroll 64 fp i-prefetch)) % copyH: 4336676894 B/s ( 944502 us) (1708211441 tsc) (movntps) % ... % copyK: 8750902650 B/s ( 468066 us) (846786401 tsc) (movq) % ... % copyN: 3029184746 B/s (1352179 us) (2445280543 tsc) (movntq) % ... % copyQ: 5487174266 B/s ( 746468 us) (1350423254 tsc) (movdqa) This shows: - movsl is quite slow - even the old kernel bcopy is barely worth using on AXP - movntps isn't too bad in the fully cached case (AXP implemenmtation detail) - movq (through 64-bit MMX) is quite fast, but not quite twice as fast as movsl. I thought that mainly instruction execution bandwidth limits prevented it being twice as fast, but maybe it is affected by the asymmetry. It does load-load-store-store repeated through 4 pairs of different MMX registers. I think it was written before I knoew anything about the CPU resource issues (I keep some mistakes for comparison). Now I think using different registers is unnecessary, and the limit of 8 or 16 MMX or 16 registers need not be hit. load-store repeated to the same register, or load-store-load-store repeated actually uses a large machine-dependent number of registers via register renaming. Even old AXP's have this. I think the register renaming works across loop boundaries and gives you many more than the 8 registers that you might think you are using. This goes with out of order execution and might be controllable only in the same way, by stalling at page boundaries. - movntq is much slower than movntps. movntps does 4 128-bit loads followed by 4 128-bit stores. movntq does 8 64-bit loads followed by 8 64-bit stores. This seems to show that 128-bit accesses are faster. - movdqa is quite slow. It uses load-store... of a single 128-bit register. Perhaps it is too optimized away from using multiple registers. In fact, rewriting it to use 8 loads followed by 8 stores shows one case where it beats movnti (64 bits) -- in the fullyL2-cached case, this speeds it up from 4.37GB/S to 4.77GB/S, making it about 0.4GB/S faster than both the other version of itself and movnti. But in the fully-L1-cached case, this speeds it down from 13.7GB/S to 9.2GB/S, making it much slower than itself and almost twice as slow as movsq. This is hard to explain, especially for the L1 case. In the L2 case, it might be explained by the 8 stores followed by 8 loads not being reordered very much so that they work almost like you want. Integer pipeline resources: Both AXP and A64 have 3 integer pipelines (they are remarkably similer in this and many other respects). It's easy to do a load/store in parallel using 2 of the pipelines. This saturates the LSU (loads must be at least 1 in advance of stores). The load/store can also be from FPU/SSE pipelines, but not independently, and on AXP the integer pipelines lose LSU bandwidth since they can only do 32-bit accesses. Other integer operations can run in the 3rd pipeline, but there are many restrictions which I don't quite understand or remember. These result in it being hard to keep the 3rd pipeline full. The above movq (i) benchmark for A64 indicates that it may be being used to hide all the loop control overhead -- I only unrolled the loop to 64 bytes, and that gives almost full speed. The loop must be unrolled a bit since a single load-store runs too fast for loop control. For the movnti benchmarks, the load-store runs slower than the loop control so unrolling is not needed. BruceReceived on Thu Jan 18 2007 - 06:03:39 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:04 UTC