Re: fast bcopy...

From: Steven Atreju <snatreju_at_googlemail.com> Date: Thu, 3 May 2012 12:28:44 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:26 UTC

K. Macy wrote [2012-05-03 02:58+0200]:
> It's highly chipset and processor dependent what works best.

Yes, of course.
Though i was kinda, even shocked, once i've seen this first:

  http://marc.info/?l=dragonfly-commits&m=132241713812022&w=2

So we don't use our assembler version for new gccs and HAMMER or
SSE3+ (the decision for these was rather arbitrarily, except they
were yet existent for an instant implementation).

> Intel now has non-temporal loads and stores which work much
> better in some cases but provide little benefit in others.

Yes, our 2002 tests have shown that these were *extremely*
dependent upon alignment.  (Note: 2002. o-)
Hmm, it doesn't really matter, but i guess this is a good time to
thank the FreeBSD hackers for that FPU stack FILD/FISTP idea!
I'll append the copy related notes of our doc/memperf.txt.
Thanks,

> -Kip

Steven.

I. x86 (AMD Athlon 1600+, 256MB DDR, 133/133 FSB)
-------------------------------------------------

COPY
....

The basic idea is always the same:
- Branch off to REPZ MOVSB if less than 16 bytes to go.
- Align at least one pointer on a nice boundary (&3 or &7).
  (Done by a byte loop; one 4/8 store is more expensive here.)
  We always align the _from pointer due to test experience.
- DEPENDENT.
- Do the remaining maximally 3 bytes in an unrolled MOVSB way.

DEPENDENT:
- !SF_FPU && !defined(SF_X86_MMX): just a matter of REPZ MOVSL.
- Otherwise we use three different loops over 64, 16 and 8 bytes,
  respectively.  If more than 4 bytes remain after that we use one
  additional MOVSL.
  Note that the 8 byte loop is not a loop but executes once only.

  The big loop uses pairs of MOVNTQ/MOVQ, MOVQ/MOVQ and FILD/FISTP, if
  _SSE, _MMX or _FPU, respectively.  The _SSE loop exists in addition and
  is never used if the non-aligned (the _to) pointer is not also aligned.
  The two smaller ones never use SSE's non-temporal moves; this way we
  simply can go no matter wether the to pointer is aligned or not.
  Tests demonstrated that non-temporal is no win for them anyway.

  At the end we add additional SFENCE (if _SSE) and EMMS (_MMX) or FEMMS
  (if _3DNOW) to serialize the non-temporal moves and clear the MMX state,
  respectively.  The SFENCE should not be needed, however.
  Prefetching is not used (very bad on Athlon (or i don't understand it)).

1. !_MMX && !_FPU
2. _MMX
3. _FPU (thanks to the FreeBSD crew for this idea!)
4. _MMX+_3DNOW+_SSE implementation (all we have).
   ([*] times in brackets show which time has been measured if the from
   pointer alignment loop has a leading '.ALIGN 2' statement; note
   especially the value for 4096...  note this value in general.)

UNT: unaligned pointers, to pointer alignment goal
UNF: unaligned pointers, from pointer alignment goal
1000 loops; times in (averaged) microseconds

P.S.: 03-04-01: SSE stuff disabled because speed for smaller ranges
considered to be more important than for large and even more largest ranges.
(And small difference for non-perfect ranges and non-aligned pointers.)

---------------------------------------------------------------------------
|bytes|   1./ UNT/ UNF |   2./ UNT/ UNF |   3./ UNT/ UNF |   4.[*]  / UNF |
|--------------------------------------------------------------------------
|16   |   34/    /     |   19/    /  37 |   21/    /  37 |   24[ 26]/  37 |
|15   |   40/    /     |   39/    /  35 |   37/    /  35 |   38[ 39]/  35 |
|32   |   36/    /     |   23/    /  30 |   23/    /  30 |   27[ 30]/  33 |
|31   |   43/    /     |   37/    /  28 |   36/    /  28 |   38[ 42]/  31 |
|64   |   45/    /     |   17/    /  38 |   17/    /  36 |   21[ 23]/  39 |
|63   |   50/    /     |   46/    /  35 |   44/    /  34 |   47[ 50]/  37 |
|128  |   59/  70/  74 |   31/    /  45 |   34/    /  47 |   34[ 36]/  50 |
|127  |   67/  82/  62 |   53/    /  45 |   51/    /  44 |   62[ 63]/  50 |
|256  |   89/ 111/ 108 |   52/    /  74 |   53/    /  77 |   50[ 50]/  76 |
|255  |   99/ 123/  96 |   67/    /  73 |   73/    /  75 |   68[ 70]/  74 |
|512  |  151/ 197/ 177 |   95/    / 131 |   98/    / 137 |   84[103]/ 137 |
|511  |  158/ 208/ 166 |  100/    / 132 |  117/    / 134 |   99[112]/ 135 |
|1024 |  274/ 395/ 314 |  179/    / 255 |  211/    / 270 |  166[207]/ 257 |
|1023 |  280/ 408/ 303 |  196/    / 253 |  225/    / 267 |  184[185]/ 253 |
|2048 |  579/ 765/ 966 |  350/    / 485 |  394/    / 511 |  389[388]/ 486 |
|2047 |  585/ 777/ 942 |  368/    / 484 |  410/    / 520 |  323[398]/ 484 |
|4096 | 1009/1385/1140 |  704/    /1036 |  761/    /1040 |  671[583]/1038 |
|4095 | 1027/1386/1130 |  721/    /1034 |  776/    /1037 |  602[604]/1035 |
|--------------------------------------------------------------------------

P.S.: ooops - i've really forgotten that the SSE stuff has been
completely disabled at a later time!  I guess we'll have to redo
some testing eventually!