Re: How to determine the L2 cache size on non-AMD CPUs (automatic page queue color tuning)?

From: Bruce Evans <bde_at_zeta.org.au>
Date: Thu, 17 Jun 2004 12:45:10 +1000 (EST)
On Wed, 16 Jun 2004, Martin Nilsson wrote:

> Alexander Leidinger wrote:
> > Now I need to know how to determine those properties on at least some
> > Intel CPUs (e.g. P3 & P4).
>
> The more expensive intel processors also have L3 caches of 1-4MB.
> Since intels processors are built with inclusive caches (data in L2
> cache is also present in L3) shouldn't the value used be that of the
> largest cache be it L2 or L3?
>
> How much effct on performance does a wrong cache size value have?

Closer to 0.1% than to 10%.  The whole page coloring optimization was
worth a few percent at best except in unusual/unlucky cases when it
was first implemented, which was when hardware caches mostly had less
associativity.  Without explict page coloring, the colors of pages
assign to an object are almost random.  This causes unnecessary
cache conflicts, but random allocation isn't too bad and on average
only gives a small number of cache conflicts which are compensated
for by associativity.

The effects of coloring are easiest to see in microbenchmarks.  E.g.:

%%%
                 L M B E N C H  2 . 0   S U M M A R Y
                 ------------------------------------

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
bes4.bde. FreeBSD 5.0-C 1509 757. 223.  514.2  923.0  373.2  372.2 742. 648.2
besplex.b FreeBSD 5.0-C 1531 736. 285.  527.2  922.1  417.9  420.2 741. 781.4
besplex.b Linux 2.4.0-t 962. 657. 731.  533.4  928.5  387.1  388.0 789. 687.6

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
---------------------------------------------------
Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
--------- -------------  ---- ----- ------    --------    -------
bes4.bde. FreeBSD 5.0-C  1533 1.958   13.1   98.5
besplex.b FreeBSD 5.0-C  1533 1.957   13.1   98.5
besplex.b Linux 2.4.0-t  1533 1.957   13.1  111.6
%%%

This is on an Athlon XP1600 overclocked by 146/133 with 2*256MB DDR PC2100
memory, running fairly old kernels.  The Linux "Main mem" latency is lower
entirely because Linux at least Linux-2.4.0-test.mumble doesn't implemtent
page coloring.  bes4 is running plain -current and besplex is running my
version of -current which has finely tuned page (actually tuned for a
Celeron, not for the Athlon) and extra color bits corresponding to the
bank organization (tuned for both a Celeron and the Athlon).  The besplex
"Mem write" bandwidth is faster entirely because of the coloring for banks.
This optimization has little effect for reads.  I don't know why Linux
is faster for "Mem read" and faster than bes4 for "Mem write".  The
bcopy bandwidths show the same optimizations as the read/write bandwidths.
The other bandwidths are determined more by software than by memory speed
or color.  All of the numbers shown in the above except the TCP bandwith
have a low variance (something in the TCP bandwidth benchmark or FreeBSD's
handling of it give a high variance and often a low perfermance under
FreeBSD).

The bank coloring optimization is worth less than 1% for makeworld although
it is worth 20% here.

Bruce
Received on Thu Jun 17 2004 - 00:45:18 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:57 UTC