Mohtalim

Posted: **Fri Mar 07, 2008 8:30 pm**

http://www.cs.wisc.edu/multifacet/ieeec ... multicore/

What does this mean? I don't know. But it looks nifty!

From HPCA08.

http://www.cs.utah.edu/hpca08/

Posted: **Sat Mar 08, 2008 8:01 am**

Perhaps you could shed some light on what the perf of some older cores, say a pentium1, pro, 2 or 3 would be compared to a core2 or atom assuming they were manufactured at the same node? That would allow one to guess at what the perf equation might be modeled as. Also, if you could give relative areas or power or whatever constraint you want to use of each processor at a modern processing node, that would allow you to calculate relative BCE cost as well which you need to do anyway before calculating the perf equation.

The idea of a dynamic processor is interesting and the payoff is clear from the graphs. How to actually do it is as opaque as hell though. I guess one way to do it is to do what I think Nowatzyk was alluding to near the end of the multiprocessor class. I didn't get it at the time, but building a set of processors of differing complexities (and therefore size) out of cellular automata would allow you to dynamically reconfigure the entire chip to match the parallelism required of the software at any given time. How to generate a reconfigure command is an interesting problem itself. Is it the compiler or is there some heuristic that the hardware could use? How do you halt processing enough to be able to reconfigure?

Dynamic recompilation in hardware a la transmeta to find parallelism may be a more realistic way to achieve a dynamic multiprocessor. You always have a lot of small cores and map the load to cores as need arises. You can know how parallel the code is by seeing how many cores are idle. If a lot are idle, then you use them to work harder at trying to find parallelism, otherwise, you're already doing good.

Posted: **Sun Mar 09, 2008 2:00 am**

quantus wrote:Perhaps you could shed some light on what the perf of some older cores, say a pentium1, pro, 2 or 3 would be compared to a core2 or atom assuming they were manufactured at the same node?

I think what you're asking for is how the IPC between architectures compares, while ignoring frequency. Please clarify if you don't.

Atom and a Pentium1 would have the same IPC on a single thread of integer code. Atom is significantly faster at multiple threads and floating point. A core2 on a single thread has a max capability of 33% more integer perf than a pentium 3. The actual average IPC increase would be significantly larger due to the branch prediction improvements, branch misprediction recovery improvements, bus, and cache. Again the floating point on C2D is way beefier than P3.

The tick tock model aims for 15% IPC improvements on new architecture, ignoring platform and frequency changes. It's a decent rule of thumb nowadays.

There are cost issues with any type of asymmetrical or dynamically scaling die. I'm not saying they're insurmountable, but they factor heavily into the decision of which ideas to green light.

It's pleasant to say that the compiler/optimizer will take care of everything, whether dynamically or not. When you take into account the amount of ILP hardward extracts by itself, it's a difficult problem to find a significant amount of more parallelism.

Mohtalim

Amdahl's Law web toy

Amdahl's Law web toy