single-isa heterogeneous multi-core architecture zvika guz [email protected] november, 2004
TRANSCRIPT
Single-ISA Heterogeneous Multi-Core Architecture
Zvika GuzZvika Guz
[email protected]@tx.technion.ac.il
November, 2004November, 2004
2
Outline Motivation
Heterogeneous multi-core architecture
ToDo list and open questions Different Objective functions
SMT as building blocks
Phase detection
Summary
3
References
“Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload Performance”Rakesh Kumar, Dean M. Tullsen, Parthasarath Ranganathan, Norman P.Jouppi, Keith I. Farkas In Proceedings of the 31st International Symposium on Computer Architecture (ISCA’04),June, 2004
“Single-ISA Heterogeneous Multi-Core Architecture: The Potential for Processor Power Reduction”Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the 36st International Symposium on Microarchitecure, December 2003
“A Multi-Core Approach to Addressing the Energy-Complexity Problem In Microprocessor” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the Workshop on Complexity-Effective Design (WCED), June 2003
“Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architecture” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, Computer Architecture Letters, Volume 2, Apr. 2003
Heterogeneous multi-core architectureHeterogeneous multi-core architecture
4
Diminishing performance return per chip area
The infamous power/performance ratio
The power wall
Chip area is bounded
Tomer’s assumptions:
perf area
VLSI sad facts of life:VLSI sad facts of life:
power area
Processor’s characteristics
5
Processor’s characteristics Few different generations of Alpha’s cores
All scaled to 0.10 micron
EV6+
6
ProcessorEV4EV5EV6EV8-
Issue-width246 (OOO)8 (OOO)
I-Cache8 KB, DM64 KB, 2-way64 KB, 2-way64 KB, 4-way
D-Cache8 KB, DM64 KB, 2-way64 KB, 2-way64 KB, 4-way
Branch Pred.2K gsharehybrid 2 levelhybrid 2 levelhybrid 2 level (2X EV6 size)
Threads1111
Area (mm2)2.875.0624.5236
Peak-power (Watt)
4.979.8317.8092.88
Typical Power (Watt)
3.736.8810.6846.44
Processor’s characteristics Few different generations of Alpha’s cores
All scaled to 0.10 micron
4.8x
1.5x
7
Processor’s characteristics
⇒ large number of small processors is better than small number of large processors
Processor are expected to supply competing objectives: High throughput for multi-thread environments
Good single thread performance
But what if TLP isn’t large enough ?
8
workloads characteristics
Different amount of ILP
Different TLP Among different applications
Among different workloads
Vary with time
Legacy code
Many applications under-utilize the hardware Suffer little performance loss when run on a less aggressive processor
Great diversity among different applicationsGreat diversity among different applications
9
workloads characteristicsWildly different intra-thread behaviorWildly different intra-thread behavior
Programs fall into phases, each phase presents different behavior Variation in resources demands
Memory/computation bound
Branch mispredictions
Cache misses
During many phases the processor is under-utilized
10
workloads characteristicsWildly different intra-thread behaviorWildly different intra-thread behavior
gzip
11
workloads characteristicsWildly different intra-thread behaviorWildly different intra-thread behavior
gcc
12
Main IdeaMain Idea A multiprocessor composed of asymmetric cores
Better area-efficient coverage of the different workloads demands:
Single thread performance (legacy code)
Elevated throughput for high TLP
Single-ISA heterogeneous Multi-Core
13
Main IdeaMain Idea A multiprocessor composed of asymmetric cores
Better area-efficient coverage of the different workloads demands:
Single thread performance (legacy code)
Elevated throughput for high TLP
Use a smart dynamic task-to-core assignment Assign each application to the core best suite to meet its performance demands
Exploit the variations in resource demands between different application's phases
Single-ISA heterogeneous Multi-Core
14
Main IdeaMain Idea A multiprocessor composed of asymmetric cores
Better area-efficient coverage of the different workloads demands:
Single thread performance (legacy code)
Elevated throughput for high TLP
Use a smart dynamic task-to-core assignment Assign each application to the core best suite to meet its performance demands
Exploit the variations in resource demands between different application's phases
Use of-the-shelf cores Amortize design and verification effort
Single-ISA heterogeneous Multi-Core
16
Architecture Model 3 different multi-core systems were compared:
4 EV6 cores (homogeneous MP)
20 EV5 cores (homogeneous MP)
3 EV6 and 5 EV5 cores (heterogeneous MP)
Each core has its own L1 caches
All cores share an on chip 4 MB L2 cache
Chip area of all 3 configurations is roughly the same Using the correct power model, so is the total power…
17
Scheduling issues OS scheduler is responsible for thread scheduling and assignment
Core-switch at OS timeslice intervals. (10-100msec)
The core-switch overhead is piggybacked with OS context switch
Application phase length are typically large, hence suite this timeslices
18
Scheduling issues OS scheduler is responsible for thread scheduling and assignment
Core-switch at OS timeslice intervals. (10-100msec)
The core-switch overhead is piggybacked with OS context switch
Application phase length are typically large, hence suite this timeslices
Sampling-based : During the Sampling phase
Thread migrate between different cores
Statistics is gathered for every allocation
During the Steady Phase:
The most beneficial allocation is used
19
Evaluation Metric: weighted speedup
Maximizing average performance gain over all applications
The jobs assigned to the EV5 are those that are least affected by its inferior capabilities
( _ )( _ )
i
i
IPC current coreIPC baseline core
all runningi threads
ws
Scheduling issues
Objective functionObjective function
20
Scheduling issues
sample-one: run each thread on each core once
sample-avg: run each thread on each core at least twice
sample-sched: constrained to choose only assignment that were actually sampled
Sampling StrategySampling Strategy
23
Scheduling issues
individual-event: whenever a thread’s IPC changes by more than 50%
global-event: whenever the total change in IPC for all threads exceeds 100%
bounded-global-event: the same as the global-event with minimum and maximum thresholds
Trigger MechanismTrigger Mechanism
25
Priorities among different threads Heterogeneous architecture ideally suite these task
Exploring different objective functionsExploring different objective functions
( _ )( _ )
i
i
IPC current corei IPC baseline core
all runningi threads
ws w
Todo list (open questions)
26
Priorities among different threads. Heterogeneous architecture ideally suite these task.
Minimize energy consumption
Use performance threshold
Exploring different objective functionsExploring different objective functions
( _ )( _ )
i
i
IPC current corei IPC baseline core
all runningi threads
ws w
Todo list (open questions)
*
( _ )
( _ )i
i
energy per instr baseline core
enrgy per instr current coreall runningi threads
ws
27
Priorities among different threads. Heterogeneous architecture ideally suite these task.
Minimize energy consumption.
Use performance threshold
Minimize the energy-delay product
Exploring different objective functionsExploring different objective functions
( _ )( _ )
i
i
IPC current corei IPC baseline core
all runningi threads
ws w
Todo list (open questions)
*
( _ )
( _ )i
i
energy per instr baseline core
enrgy per instr current coreall runningi threads
ws
2
2
_
_
i
i
i
i
IPScurrent core
Watt
IPSbaseline coreall runningi Wattthreads
ws
28
Energy-delay product during Energy-delay product during appluapplu life-time life-time
0
0.04
0.08
0.12
0.16
0.2
1 201 401 601 801
Committed instructions(in millions)
IPS
^2
/W
R4700EV4EV5EV6EV8-
1
enrgy delay
Todo list (open questions)
29
Great potential for energy saving: Pervious work, considering only one thread at a time, achieved more
than 30% of energy saving
Idle cores can be shut down
Objective function may change on the fly according to changing power conditions
Todo list (open questions)Exploring different objective functionsExploring different objective functions
30
Using SMT Cores
Enlarge flexibility and throughput with only modest area and power penalty
MotivationMotivation
No free lunches:No free lunches:
Interaction between threads can no longer be ignored Thread compete for virtually all processor resources
Only sampled assignments can be used
Permutations space of potential assignments is huge Can not sample all the assignments
The sampling space must be pruned
The sampling strategy is much more important
32
References
“Symbiotic Jobscheduling for a Simultaneous Multithreading Processor”Allan Snavely, Dean M. Tullsen, In the Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX),Novemeber, 2000
“Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor”Allan Snavely, Dean M. Tullsen, In proceedisng of the 9th International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 02), June, 2002
SOS (Sample, Optimize, Symbios)SOS (Sample, Optimize, Symbios)
33
SOS - (Sample, Optimize, Symbios) A first attempt to take into account threads interaction in SMT
symbiosis – the effectiveness with which multiple jobs achieve speedup when coexcecuted on multithreaded machines Throughput may actually go down
Plundered from Uri’s slides.
34
SOS - (Sample, Optimize, Symbios) Using sampling phases to profile execution
Choose combination that maximize overall weighted speedup
Which predictor to use ?
Extremely architecture dependent Encapsulation of hardware details from software
IPC and Dcache are inconsistent performers
35
The moral of the SMT case
When considering more ‘clustered’ architectures, groups of cores may share resources: L1 caches
Memory hierarchy
FPU , TLB ?
SMT and CMP are just two extremes of a viable spectrum
Threads interaction significantly complicates our life
36
The moral of the SMT case
A scheduler has 2 tasks: Define a running set – jobs to be executed during the upcoming
timeslice
Assign jobs from the running set to the different cores
We have overlooked the first task and simplified the second
We’ll have to tackle both
If only memory is to be shared our life may be easier Not by that much, though
Threads interaction significantly complicates our life
37
References
“Phase Tracking and Prediciton” Timothy Sherwood, Suleyman Sair, Brad Calder, Proceedings of the 30th annual international symposium on Computer architecture, IEEE CS Press, 2003, pp.336-349
“Discovering and Exploiting Program Phases” Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder, IEEE Micro : Micro's Top Picks from Computer Architecture Conferences, Nov./Dec. 2003
Phase trackingPhase tracking
38
It is Profitable to accurately identify program phases React quickly to phase changes
Spare samplings overheads
IPC is not necessarily the most appropriate representative
Smart phase detectionSmart phase detection
Todo list (open questions)
39
Phases are a direct function of the way program traverse its code during execution Use basic blocks ratios to identify phases
Architectural independent
Basic Block Vector One dimension array with an index for every basic block in the
program
Each element represent the execution frequencies of the basic blocks weighted by instruction count, normalized
Phase TrackingMain IdeaMain Idea
40
Phase TrackingMain IdeaMain Idea
Basic Block Vector One dimension array with an index for every basic block in the
program
Each element represent the execution frequencies of the basic blocks weighted by instruction count, normalized
42
Work’s Innovation Using heterogeneous muti-cores to gain superior performance
Previous works targeted only power consumption
First real simulation results
General-purpose processors Previous works concentrated on SoC with known workloads
Dynamic task scheduling and task-to-core assignment Most of the works use static scheduling and a full knowledge of the
application characteristic
43
Summary Heterogeneous multi-core architecture can
provide significantly higher performance Covers a wide spectrum of workloads Dynamic core assignments policy exploit
intra-thread and inter-thread diversity Open issues:
Optimize energy consumption Thread interactions Phase detection A lot more…
Any questions ?
44
References of the day “Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload
Performance”Rakesh Kumar, Dean M. Tullsen, Parthasarath Ranganathan, Norman P.Jouppi, Keith I. Farkas In Proceedings of the 31st International Symposium on Computer Architecture (ISCA’04), June, 2004
“Single-ISA Heterogeneous Multi-Core Architecture: The Potential for Processor Power Reduction”Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the 36st International Symposium on Microarchitecure, December 2003
“A Multi-Core Approach to Addressing the Energy-Complexity Problem In Microprocessor” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the Workshop on Complexity-Effective Design (WCED), June 2003
“Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architecture” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, Computer Architecture Letters, Volume 2, Apr. 2003
“Phase Tracking and Prediciton” Timothy Sherwood, Suleyman Sair, Brad Calder, Proceedings of the 30th annual international symposium on Computer architecture, IEEE CS Press, 2003,
“Discovering and Exploiting Program Phases” Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder, IEEE Micro : Micro's Top Picks from Computer Architecture Conferences, Nov./Dec. 2003
“Symbiotic Jobscheduling for a Simultaneous Multithreading Processor”Allan Snavely, Dean M. Tullsen, In the Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Novemeber, 2000
45
References of the day “Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor”
Allan Snavely, Dean M. Tullsen, In proceedisng of the 9th International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 02), June, 2002
“Conjoind-core Chip Multiprocessing”Rakesh Kumar, Norman P.Jouppi, Dean M. Tullsen, In Proceedings of the 37st International Symposium on Microarchitecure, December 2004
47
Sample 2n configuration for n threads workload.
Pruning strategies: pref-EV6 : assumes it is best to run 2 thread on each EV6 before using
EV5
pref-EV5 : assumes it’s best to run on EV5 rather than put 2 threads on EV6
pref-nigther : sample random schedule
pref-similar: sampling is biased toward a configuration similar to the current one used.
Sampling strategiesSampling strategies
Using Multithreaded Cores