Re: HTT on current

From: Garrett Wollman <wollman_at_khavrinen.lcs.mit.edu>
Date: Mon, 25 Aug 2003 13:35:09 -0400 (EDT)
<<On Mon, 25 Aug 2003 14:14:18 -0300, "Daniel C. Sobral" <dcs_at_tcoip.com.br> said:

> There are two problems with HTT. First, L1/L2 cache issues. Second, the 
> virtual CPUs are not independent, and there are many cases where 
> instructions in one virtual CPU stall the other. So take, for example, 
> the case of a userland application on CPU0 stalling the kernel on CPU1.

I don't think that this is quite stated right.  The problem is that
the P4 is not very wide to begin with, and it's very hard to optimize
well for that 23-stage pipeline.[1]  So if you have a thread with lots
of latent ILP (either because you did a good job optimizing it for a
four-way superscalar, or because you did a bad job scheduling it and
are depending on the processor to make up for the naive optimization),
it is bound to run more slowly when some of the functional units it
could have used are taken by another thread of execution.  But some
sorts of applications can benefit, if the application can be
decomposed into threads that exercise different FUs (for example, one
thread that is memory intensive and one thread that is compute
intensive).  The challenge then is to make sure that they always get
scheduled on the same processor at the same time.

The key to getting good performace on an SMT architecture with an
arbitrary instruction mix is more functional units.  The never-built
Alpha EV8, which was to be an eight-way superscalar with four-way SMT
and a wide memory bus, would be much easier with which to achieve
optimum performance.

-GAWollman

[1] That's why the Athlon gets more instructions per cycle: it has a
much shallower pipeline and more functional units, so it can execute
naively-optimized, ILP-heavy code much faster without stalling.
Received on Mon Aug 25 2003 - 08:35:14 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:20 UTC