false alarm (Re: __builtin_memcpy() slower than memcpy/bcopy (and on linux it is the opposite) ?)

From: Luigi Rizzo <rizzo_at_iet.unipi.it> Date: Thu, 24 Jan 2013 03:54:42 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:34 UTC

On Wed, Jan 23, 2013 at 05:32:38PM +0100, Luigi Rizzo wrote:
> Probably our compiler folks have some ideas on this...
> 
> When doing netmap i found that on FreeBSD memcpy/bcopy was expensive,
> __builtin_memcpy() was even worse, and so i ended up writing
> my custom routine, (called pkt_copy() in the program below).
> This happens with gcc 4.2.1, clang, gcc 4.6.4
> 
> I was then surprised to notice that on a recent ubuntu using
> gcc 4.6.2 (if that matters) the __builtin_memcpy beats other
> methods by a large factor.

so, it turns out that in my test program I had swapped the
source and destination operands for __builtin_memcpy(), and
this substantially changed the memory access pattern.

With the correct operands, __builtin_memcpy == memcpy == bcopy
on both FreeBSD and Linux.
On FreeBSD pkt_copy is still faster than the other methods for
small packets, whereas on Linux they are equivalent.

If you are curious why swapping source and dst changed things
so dramatically:

the test was supposed to read from a large chunk of
memory (over 1GB) to avoid always hitting L1 or L2.
Swapping operands causes reads to hit always the same line,
thus saving a lot of misses. The difference between the two
machine then probably is due to how the cache is used on writes.

sorry for the noise.
luigi