On 2013-01-23 17:32, Luigi Rizzo wrote: > Probably our compiler folks have some ideas on this... > > When doing netmap i found that on FreeBSD memcpy/bcopy was expensive, > __builtin_memcpy() was even worse, Which compilation flags did you use to test this? When I compiled your testcase program with clang 3.2, gcc 4.2 and gcc 4.7 at -O2, with all other settings at their defaults, all three compilers just called libc's memcpy() for the __builtin_memcpy tests. For example, with gcc 4.7, the loop in test_builtin_memcpy becomes: .L116: movq %rbx, %rax addq $1, %rbx andl $262143, %eax movq %rax, %rdx salq $12, %rax salq $8, %rdx leaq huge(%rdx,%rax), %rsi movq %r12, %rdx call memcpy movq 24(%rbp), %rax movq 0(%rbp), %rdi addq $1, %rax cmpq %rbx, 4096(%rdi) movq %rax, 24(%rbp) jg .L116 The other routines are emitted as similar code. For test_bcopy() the loop becomes: .L123: movq %rbx, %rax addq $1, %rbx andl $262143, %eax movq %rax, %rdx salq $12, %rax salq $8, %rdx leaq huge(%rdx,%rax), %rsi movq %r12, %rdx call bcopy movq 24(%rbp), %rax movq 0(%rbp), %rdi addq $1, %rax cmpq %rbx, 4096(%rdi) movq %rax, 24(%rbp) jg .L123 and similarly, for test_memcpy() it becomes: .L109: movq %rbx, %rax addq $1, %rbx andl $262143, %eax movq %rax, %rdx salq $12, %rax salq $8, %rdx leaq huge(%rdx,%rax), %rdi movq %r12, %rdx call memcpy movq 24(%rbp), %rax movq 0(%rbp), %rsi addq $1, %rax cmpq %rbx, 4096(%rsi) movq %rax, 24(%rbp) jg .L109 In our libc, bcopy and memcpy are implemented from the same source file, which just the arguments swapped around. So I fail to see what could cause the performance difference between __builtin_memcpy, memcpy and bcopy you are seeing. Also, on amd64, this is implemented in lib/libc/amd64/string/bcopy.S, so the compiler does not have any influence on its performance. Note the routine uses "rep movsq" as its main loop, which is apparently not the best way on modern CPUs. Maybe you have found another instance where hand-rolled assembly is slower than compiler-optimized code... :-) With gcc 4.7, your fast_bcopy() gets inlined to this: .L131: movq (%rax), %rdx subl $64, %ecx movq %rdx, (%rsi) movq 8(%rax), %rdx movq %rdx, 8(%rsi) movq 16(%rax), %rdx movq %rdx, 16(%rsi) movq 24(%rax), %rdx movq %rdx, 24(%rsi) movq 32(%rax), %rdx movq %rdx, 32(%rsi) movq 40(%rax), %rdx movq %rdx, 40(%rsi) movq 48(%rax), %r9 movq %r9, 48(%rsi) movq 56(%rax), %r9 addq $64, %rax movq %r9, 56(%rsi) addq $64, %rsi testl %ecx, %ecx jg .L131 while clang 3.2 produces: .LBB14_5: movq (%rdi), %rcx movq %rcx, (%rsi) movq 8(%rdi), %rcx movq %rcx, 8(%rsi) movq 16(%rdi), %rcx movq %rcx, 16(%rsi) addl $-64, %eax movq 24(%rdi), %rcx movq %rcx, 24(%rsi) testl %eax, %eax movq 32(%rdi), %rcx movq %rcx, 32(%rsi) movq 40(%rdi), %rcx movq %rcx, 40(%rsi) movq 48(%rdi), %rcx movq %rcx, 48(%rsi) movq 56(%rdi), %rcx leaq 64(%rdi), %rdi movq %rcx, 56(%rsi) leaq 64(%rsi), %rsi jg .LBB14_5 Both are most likely faster than the "rep movsq" logic in bcopy.S. > and so i ended up writing > my custom routine, (called pkt_copy() in the program below). > This happens with gcc 4.2.1, clang, gcc 4.6.4 > > I was then surprised to notice that on a recent ubuntu using > gcc 4.6.2 (if that matters) the __builtin_memcpy beats other > methods by a large factor. On Ubuntu, I see the same thing as on FreeBSD; __builtin_memcpy just calls the regular memcpy. However, eglibc's memcpy looks to be more highly optimized; there are several CPU-specific implementations, for example for i386 and amd64 arches: sysdeps/i386/i586/memcpy_chk.S sysdeps/i386/i586/memcpy.S sysdeps/i386/i686/memcpy_chk.S sysdeps/i386/i686/memcpy.S sysdeps/i386/i686/multiarch/memcpy_chk.S sysdeps/i386/i686/multiarch/memcpy.S sysdeps/i386/i686/multiarch/memcpy-ssse3-rep.S sysdeps/i386/i686/multiarch/memcpy-ssse3.S sysdeps/x86_64/memcpy_chk.S sysdeps/x86_64/memcpy.S sysdeps/x86_64/multiarch/memcpy_chk.S sysdeps/x86_64/multiarch/memcpy.S sysdeps/x86_64/multiarch/memcpy-ssse3-back.S sysdeps/x86_64/multiarch/memcpy-ssse3.S Most likely, your test program on Ubuntu is calling the ssse3 version, which should be much faster than any of the above loops. > Here are the number in millions of calls per second. Is the test > program flawed, or the compiler is built with different options ? I think the test program looks fine after lightly skimming it. FreeBSD's memcpy is probably just slower for the CPUs you have been testing on.Received on Wed Jan 23 2013 - 18:26:17 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:34 UTC