Re: __builtin_memcpy() slower than memcpy/bcopy (and on linux it is the opposite) ?

From: Dimitry Andric <dim_at_FreeBSD.org> Date: Wed, 23 Jan 2013 20:26:14 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:34 UTC

On 2013-01-23 17:32, Luigi Rizzo wrote:
> Probably our compiler folks have some ideas on this...
>
> When doing netmap i found that on FreeBSD memcpy/bcopy was expensive,
> __builtin_memcpy() was even worse,

Which compilation flags did you use to test this?  When I compiled your
testcase program with clang 3.2, gcc 4.2 and gcc 4.7 at -O2, with all
other settings at their defaults, all three compilers just called libc's
memcpy() for the __builtin_memcpy tests.

For example, with gcc 4.7, the loop in test_builtin_memcpy becomes:

.L116:
         movq    %rbx, %rax
         addq    $1, %rbx
         andl    $262143, %eax
         movq    %rax, %rdx
         salq    $12, %rax
         salq    $8, %rdx
         leaq    huge(%rdx,%rax), %rsi
         movq    %r12, %rdx
         call    memcpy
         movq    24(%rbp), %rax
         movq    0(%rbp), %rdi
         addq    $1, %rax
         cmpq    %rbx, 4096(%rdi)
         movq    %rax, 24(%rbp)
         jg      .L116

The other routines are emitted as similar code.  For test_bcopy() the
loop becomes:

.L123:
         movq    %rbx, %rax
         addq    $1, %rbx
         andl    $262143, %eax
         movq    %rax, %rdx
         salq    $12, %rax
         salq    $8, %rdx
         leaq    huge(%rdx,%rax), %rsi
         movq    %r12, %rdx
         call    bcopy
         movq    24(%rbp), %rax
         movq    0(%rbp), %rdi
         addq    $1, %rax
         cmpq    %rbx, 4096(%rdi)
         movq    %rax, 24(%rbp)
         jg      .L123

and similarly, for test_memcpy() it becomes:

.L109:
         movq    %rbx, %rax
         addq    $1, %rbx
         andl    $262143, %eax
         movq    %rax, %rdx
         salq    $12, %rax
         salq    $8, %rdx
         leaq    huge(%rdx,%rax), %rdi
         movq    %r12, %rdx
         call    memcpy
         movq    24(%rbp), %rax
         movq    0(%rbp), %rsi
         addq    $1, %rax
         cmpq    %rbx, 4096(%rsi)
         movq    %rax, 24(%rbp)
         jg      .L109

In our libc, bcopy and memcpy are implemented from the same source file,
which just the arguments swapped around.  So I fail to see what could
cause the performance difference between __builtin_memcpy, memcpy and
bcopy you are seeing.

Also, on amd64, this is implemented in lib/libc/amd64/string/bcopy.S, so
the compiler does not have any influence on its performance.  Note the
routine uses "rep movsq" as its main loop, which is apparently not the
best way on modern CPUs.  Maybe you have found another instance where
hand-rolled assembly is slower than compiler-optimized code... :-)

With gcc 4.7, your fast_bcopy() gets inlined to this:

.L131:
         movq    (%rax), %rdx
         subl    $64, %ecx
         movq    %rdx, (%rsi)
         movq    8(%rax), %rdx
         movq    %rdx, 8(%rsi)
         movq    16(%rax), %rdx
         movq    %rdx, 16(%rsi)
         movq    24(%rax), %rdx
         movq    %rdx, 24(%rsi)
         movq    32(%rax), %rdx
         movq    %rdx, 32(%rsi)
         movq    40(%rax), %rdx
         movq    %rdx, 40(%rsi)
         movq    48(%rax), %r9
         movq    %r9, 48(%rsi)
         movq    56(%rax), %r9
         addq    $64, %rax
         movq    %r9, 56(%rsi)
         addq    $64, %rsi
         testl   %ecx, %ecx
         jg      .L131

while clang 3.2 produces:

.LBB14_5:
         movq    (%rdi), %rcx
         movq    %rcx, (%rsi)
         movq    8(%rdi), %rcx
         movq    %rcx, 8(%rsi)
         movq    16(%rdi), %rcx
         movq    %rcx, 16(%rsi)
         addl    $-64, %eax
         movq    24(%rdi), %rcx
         movq    %rcx, 24(%rsi)
         testl   %eax, %eax
         movq    32(%rdi), %rcx
         movq    %rcx, 32(%rsi)
         movq    40(%rdi), %rcx
         movq    %rcx, 40(%rsi)
         movq    48(%rdi), %rcx
         movq    %rcx, 48(%rsi)
         movq    56(%rdi), %rcx
         leaq    64(%rdi), %rdi
         movq    %rcx, 56(%rsi)
         leaq    64(%rsi), %rsi
         jg      .LBB14_5

Both are most likely faster than the "rep movsq" logic in bcopy.S.

> and so i ended up writing
> my custom routine, (called pkt_copy() in the program below).
> This happens with gcc 4.2.1, clang, gcc 4.6.4
>
> I was then surprised to notice that on a recent ubuntu using
> gcc 4.6.2 (if that matters) the __builtin_memcpy beats other
> methods by a large factor.

On Ubuntu, I see the same thing as on FreeBSD; __builtin_memcpy just
calls the regular memcpy.  However, eglibc's memcpy looks to be more
highly optimized; there are several CPU-specific implementations, for
example for i386 and amd64 arches:

sysdeps/i386/i586/memcpy_chk.S
sysdeps/i386/i586/memcpy.S
sysdeps/i386/i686/memcpy_chk.S
sysdeps/i386/i686/memcpy.S
sysdeps/i386/i686/multiarch/memcpy_chk.S
sysdeps/i386/i686/multiarch/memcpy.S
sysdeps/i386/i686/multiarch/memcpy-ssse3-rep.S
sysdeps/i386/i686/multiarch/memcpy-ssse3.S
sysdeps/x86_64/memcpy_chk.S
sysdeps/x86_64/memcpy.S
sysdeps/x86_64/multiarch/memcpy_chk.S
sysdeps/x86_64/multiarch/memcpy.S
sysdeps/x86_64/multiarch/memcpy-ssse3-back.S
sysdeps/x86_64/multiarch/memcpy-ssse3.S

Most likely, your test program on Ubuntu is calling the ssse3 version,
which should be much faster than any of the above loops.

> Here are the number in millions of calls per second.  Is the test
> program flawed, or the compiler is built with different options ?

I think the test program looks fine after lightly skimming it.
FreeBSD's memcpy is probably just slower for the CPUs you have been
testing on.