On Sat, Jun 11, 2011 at 02:41:50AM +0200, Luigi Rizzo wrote: > just for the records: the AMD motherboard works fine and can reach > 14.88Mpps, i was just doing a couple of mistakes in my AMD tests, > including the use of a slot with 16x form factor but only 4 lanes > connected. > This said, the i7-870 is about twice as fast as the Athlon II X4-635 > in generating packets for the same clock speed. > I think the different cache size might have some impact on the > result given the Athlon has no L3 cache and the test program surely > overflows the 512k L2 cache (i am using a total of 8k packet buffers, > touching 64 bytes each for the payload, plus 24 bytes each for > descriptors). > Unfortunately at these speeds even small things matter a lot! It may help to use non-temporal stores to fill the packet buffers. Because this data will never be read again by the CPU, caching it is useless. Also, non-temporal stores may help avoid reading a cache line only to overwrite it completely. With SSE, this could be done with a loop of four MOVUPS and four MOVNTPS instructions, transferring 64 bytes per iteration, and an SFENCE at the end (or the corresponding intrinsics from <xmmintrin.h>, _mm_loadu_ps(), _mm_stream_ps(), _mm_sfence()). For the receive side, there are also various non-temporal loads and prefetch instructions. On the other hand, because generating small packets only writes to 64 bytes of each 2048 byte aligned block, only a small portion of the cache will be polluted. This is because caches are usually not fully associative. This small portion could contain other important data, however. When generating full 1500 byte packets, most of the cache will be polluted. Because caching is not useful for the ring buffers, it is probably not a problem that they are laid out in such a way that they cannot be cached efficiently. -- Jilles TjoelkerReceived on Sat Jun 11 2011 - 21:02:50 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:14 UTC