a multi-core approach to addressing the energy-complexity problem in microprocessors rakesh kumar...
TRANSCRIPT
A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors
Rakesh Kumar
Keith Farkas (HP Labs)Norman Jouppi (HP Labs)Partha Ranganathan (HP Labs)Dean Tullsen (UCSD)
Motivation
Power is an important issue for processors
Going up every successive generation (with complexity)
-Up to 150W for Alpha 21464!
Past Techniques for Power Reduction Voltage/frequency scaling Limitation: Limited by technology. Also, not
possible below a certain feature-size.
Architectural Adaptation-shut off portions of core when not
needed-dynamic speculation control -reconfigurable caches
Limitations: -Very few choices to make
-Only dynamic power being saved -Has associated overhead
Our Proposal Have multiple heterogeneous cores on the
same die
Match workload (or workload phase) to core that achieves best efficiency according to some objective function
(Ensure that the new core has acceptable performance)
Power down the unused cores
Motivation Hypotheses
Performance difference between cores varies based on workload or workload phases
Different cores have varying relative energy efficiencies for the same workload
Implication: possibility of dynamically changing “best” core
Goals of the Paper
Validate the hypotheses
Get an idea of the design space
Get an idea of the potential benefits
Outline of Talk Motivation
Past Work
Our Work Assumptions Decisions Methodology
Results and Conclusions
Summary and Future Work
Choice of Cores on the Die Five Cores on the die: In-order: QED R4700, EV4(Alpha 21064), EV5(Alpha 21164)
Out-of-order: EV6 (Alpha 21264),"EV8-“
All cores assumed to be without L2-cache.
“EV8-”: Issue width is same as EV8(Alpha 21464) -Resources reduced to account for a single thread. -Core-power dissipation: 100W
Properties of the Cores
Processor R4700 EV4 EV5 EV6 EV8-
Issue-width 1 2 4 6(OOO) 8(OOO)
I-Cache 2-way 16KB DM, 8KB DM, 8KB 2-way 64KB 4-way 64KB
D-Cache 2-way 16KB DM, 8KB DM, 8KB 2-way 64KB 4-way 64KB
Branch Pred.
No 2KB/1-bit 2K-gshare Hybrid 2-levelHybrid 2-level
MSHR 1 2 4 8 16
Notice the gradation!
Properties of Cores (contd.) Assume all cores implemented in 0.1um
-Scaled area and power accordingly
Clock Speed?
-All Alpha cores assumed to run at 2.1GHz (EV6 frequency at 0.10 micron)
-R4700 assumed to run at 1GHz
Core Power and Area peak power of core estimated from data sheets
- minus that used by L2 caches and pins - then scaled for .1um process
area of core estimated from die photos - minus that of i/o pad, wires, L2 cache & control - then scaled for .1um process
L2 cache area and power - estimated using CACTI
Core Power and Area (contd.)
Processor Core-power (in W) Core-area (in mm^2)
R4700 0.45 3
EV4 4.97 3
EV5 9.83 5
EV6 17.80 24
EV8- 92.88 260
EV8- consumes 200 times more power than R4700! It is more than 85 times
bigger too!
Methodology Simulator used: SMTSIM
ROB-size, Activelist-size and Load-store queue always kept big enough to ensure no conflicts.
Benchmarks used: 14 chosen randomly out of SPEC2000 suite
Fast-forwarded for 2 billion instructions, simulated for 1 billion instructions.
Data collected after every 1 million instructions.
Validating Hypotheses
Performance difference between cores varies based on workload or workload phases (IPS)
Different cores have varying relative energy efficiencies for the same workload (IPS/W)
Performance Variation with Time
0
0.4
0.8
1.2
1.6
2
1 201 401 601 801
Committed instructions (in millions)
IPS
EV8-EV6EV5EV4R4700
Ah! Those clear, distinct phases!
Variation of Energy Efficiency with Time
0
10
20
30
40
50
60
70
80
1 201 401 601 801
Committed instructions (in millions)
IPS
/W
R4700EV4EV5EV6EV8-
Power dominates IPS/W numbers!
Energy-delay Product Profile
0
0.04
0.08
0.12
0.16
0.2
1 201 401 601 801
Committed instructions(in millions)
IPS
^2/
W
R4700EV4EV5EV6EV8-
Choosing Dynamically the Core with Best Energy-Delay Product (perf. loss<50%)
0
0.04
0.08
0.12
0.16
0.2
1 201 401 601 801
Committed instructions (in millions)
IPS
^2
/W
R4700EV4EV5EV6EV8-Best-path
Notice the regions where best-path is not along the best energy-delay
product!
Choosing Dynamically the Core with Best Energy-Delay product (perf. loss<50%) [Summary of Results]
Energy-Delay Savings(%)
Performance Degradation(%)
Maximum 97.9 8.5Minimum 0.1 0.1
Mean 65.4 18.2
Number of Switchings:Maximum=387(art)Minimum=0Median=1
Dissecting the Results
More improvements possible –
locally-best decisions not necessarily globally-best
there was a performance constraintchoice of cores not the best for this objective-function
cache-configurations not necessarily the best
Even for present improvements, beats voltage scaling handsomely(44.2% ED2 improvement)
Conclusion Enormous potential for power-savings
No leakage-power solution
Does considerable IP reuse
Complexity-appropriate-every application match to the “appropriate” complexity core
Tip of the iceberg? Current/Future Work
Cores can be non-ordered
Some cores can be multithreaded
Throughput impact of the architecture