mixed speculative multithreaded execution models marcelo cintra university of edinburgh
TRANSCRIPT
Mixed Speculative Multithreaded Execution
ModelsMarcelo Cintra
University of Edinburghhttp://www.homepages.inf.ed.ac.uk/mc/
Projects/VESPA
University of Manchester - 2010 2
Context and Motivation
Multi-cores are here to stay and many-cores are coming
Excellent performance for embarrassingly parallel or throughput workloads. Otherwise, ...
Core Duo(Yonah - 2006)
Core i7(Lynnfield - 2010) SCC
(2015?)
University of Manchester - 2010 3
Context and Motivation
Future many-cores will have many idle cores too often– Not enough applications– Not enough benefit from using more “explicit user threads”
Our proposal: use spare cores to accelerate whatever threads are available– Create implicit threads to run in parallel with main explicit
user threads– Accelerate user threads through increased coarse-grain
overlap (i.e., TLP) or increased fine-grain overlap (i.e., ILP)– Combine previously proposed speculative multithreading
techniques: thread-level speculation (TLS), helper threads (HT), run-ahead execution (RA), and multi-path execution (MP)
University of Manchester - 2010 4
Context and Motivation
Why is combining SM schemes a good idea?– No speculative multithreading scheme alone is good enough– Hardware support for all schemes is very similar
Expected end-result?– Better performance– More “effective” many-core experience:
With 4-way speculative multithreading (i.e., 4 implicit SM threads for each explicit user thread) a 64 core “unwieldy” many-core is as “easy to handle” as a 16 core system
How about power efficiency?– Speculation can be made less inefficient (we’re working on
it)– Power can be smartly allocated (see our IPDPS’10 paper)
University of Manchester - 2010 5
Contributions
Introduce mixed Speculative Multithreading (SM) Execution Models
Design and evaluated two combinations: TLS+HT+RA [ICS’09] and TLS+MP [HPCA’10]
Propose a performance model able to quantify ILP and TLP benefits
Combined approaches outperform state-of-the-art SM models:– TLS+HT+RA: TLS by 10.2% avg. (up to 41.2%) and
RA by 18.3 % avg. (up to 35.2%)– TLS+MP: TLS by 9.2% avg. (up to 23.2%) and MP by
28.2 % avg. (up to 138%)
University of Manchester - 2010 6
Outline
Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions
Speculative Multithreading Basic Idea: Use idle cores/contexts to
speculate on future application needs– TLS: speculatively execute parallel threads– HT/RA: speculatively perform future memory
operations– MP: speculatively execute along multiple branch
targets Speculative threads supported in hardware Compiler support not essential, but can be
useful Hardware infrastructure is very similar
University of Manchester - 2010 7
University of Manchester - 2010 8
Thread Level Speculation
Compiler deals with:– Task selection– Code generation
HW deals with:– Different context– Spawn threads– Detecting violations– Replaying – Arbitrate commit
Benefit: TLP/ILP– TLP (Overlapped
Execution) + ILP (Prefetching)
Helper Threads
Compiler deals with:– Memory ops miss/
hard-to-predict branches
– Backward slices
HW deals with:– Spawn threads– Different context– Discard when
finished
Benefit:– ILP
(Prefetch/Warmup) University of Manchester - 2010 9
RunAhead Execution
Compiler deals with:– Nothing
HW deals with:– Different context– When to do RA– VP Memory– Commit/Discard
Benefit:– ILP (Prefetch/Warmup)
University of Manchester - 2010 10
MultiPath Execution
Compiler deals with:– Nothing
HW deals with:– Different context– When to do MP– Discard wrong path
Benefit:– ILP (Branch Pred.)
11University of Manchester - 2010
MainThread
Tim
e
Correct Paths
Wrong Paths
BranchMisp. Cost
University of Manchester - 2010 12
Outline
Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions
University of Manchester - 2010 13
Combining TLS, HT and RA
Start with TLS Provide support to clone TLS threads and
convert them to HT Conversion to HT means:
– Put them in RA mode– Suppress squashes and do not cause additional
squashes– Discard them when they finish
No compiler slicing purely HW approach
Intricacies to be Handled HT may not prefetch effectively! Dealing with contention
– HT threads much faster saturate BW
Dealing with thread ordering– TLS imposes total thread order– HT killed squashes TLS threads
University of Manchester - 2010 14
Creating and Terminating HT Create a HT on a L2 miss we can VP
– Use mem. address based confidence estimator– VP only if confident
Create a HT if we have a free processor Only allow most speculative thread to clone
– Seamless integration of HT with TLS– BUT: if parent no longer the most spec. TLS
thread, the HT has to be killed Additionally kill HT when:
– Parent/HT thread finishes– HT causes exception
University of Manchester - 2010 15
University of Manchester - 2010 16
Outline
Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions
Mixed Execution Model When idle resources:
– Try MP on top of TLS!! Map TLS threads on empty cores Map MP threads on empty contexts (same
core)
Minimal extra HW:– Branch confidence estimator– MP bit – thread on MP mode – PATHS – how many outstanding
branches– DIR – which path thread followed
17University of Manchester - 2010
Combined TLS/MP Model
18University of Manchester - 2010
Thread 1
Thread 2
SpeculativeT
ime
Combined TLS/MP Model
19University of Manchester - 2010
Thread 1
Thread 2
SpeculativeT
ime
Low Confidence Branch
Thread 1
MP: 0PATHS: 000DIR: 000
Combined TLS/MP Model
20University of Manchester - 2010
Thread 1a
Thread 2
SpeculativeT
ime
Multi-PathMode
Thread 1a
MP: 1PATHS: 001DIR: 000
Thread 1b
MP: 1PATHS: 001DIR: 001
Thread 1b
Combined TLS/MP Model
21University of Manchester - 2010
Thread 1a
Thread 2
SpeculativeT
ime
Branch Resolved
Thread 1b
Thread 1a
MP: 1PATHS: 001DIR: 000
Thread 1b
MP: 0PATHS: 000DIR: 000
Intricacies to be Handled How do we map TLS/MP threads?
– Different mapping policies for TLS threads Dealing with thread ordering
– Correct data forwarding Dealing with violations
– While in “MP-Mode” delay restarts/kills/commits
– No squashes on the wrong path Thread spawning:
– Delayed as well – keep contention low
22University of Manchester - 2010
University of Manchester - 2010 23
Outline
Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions
University of Manchester - 2010 24
Understanding Performance Benefits Complex TLS thread interactions,
obscure performance benefits Even more true for mixed execution
models We need a way to quantify ILP and TLP
contributions to bottom-line performance
Proposed model:– Able to break benefits in ILP/TLP
contributions
Performance Model
Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)
University of Manchester - 2010 25
Tseq/Tmt
Performance Model
Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)
University of Manchester - 2010 26
Tseq/T1p
Performance Model
Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)3. Compute speedup due to ILP (Silp)
University of Manchester - 2010 27
(T1+T2)/(T1’+T2’)
Performance Model
Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)3. Compute speedup due to ILP (Silp)4. Use everything to compute TLP (Sovl)
University of Manchester - 2010 28
Sall/(Sseq x Silp)
University of Manchester - 2010 29
Outline
Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions
University of Manchester - 2010 30
Experimental Setup
Simulator, Compiler and Benchmarks:– SESC (http://sesc.sourceforge.net/)– POSH (Liu et al. PPoPP ‘06)– Spec 2000 Int.
Architecture: (for TLS+HT+RA scheme)– Four way CMP, 4-Issue cores– 16KB L1 Data (multi-versioned) and Instruction Caches– 1MB unified L2 Caches– Inst. window/ROB – 80/104 entries– 16KB Last Value Predictor
31
Experimental Setup
Simulator, Compiler and Benchmarks:– SESC (http://sesc.sourceforge.net/)– POSH (Liu et al. PPoPP ‘06)– Spec 2000 Int.
Architecture: (for TLS+MP scheme)– Four way CMP, 4-Issue cores, 6 contexts / core– 32K-bit OGEHL, 1KByte BTB, 32-Entry RAS– 8 Kbit enhanced JRS confidence estimator– 32KB L1 Data (multi-versioned) and Instruction Caches– 1MB unified L2 Caches
University of Manchester - 2010
University of Manchester - 2010 32
Results I
TLS + HT + RA
University of Manchester - 2010 33
Comparing TLS, RunAhead and Unified Scheme
University of Manchester - 2010 34
Comparing TLS, RunAhead and Unified Scheme
Almost additive benefits
University of Manchester - 2010 35
Comparing TLS, RunAhead and Unified Scheme
Almost additive benefits 10.2% over TLS, 18.3% over RA
Understanding the extra ILP Improvements of ILP come from:
– Mainly memory – Branch prediction (improvement
0.5%) Focus on memory:
– Miss rate on committed path– Clustering of misses (different cost)
University of Manchester - 2010 36
Normalized Shared Cache Misses
All schemes better than sequential Unified 41% better than sequential
University of Manchester - 2010 37
Isolated vs. Clustered Misses
. Both TLS + RA Large window
machines Unified does even better
University of Manchester - 2010 38
University of Manchester - 2010 39
Results II
TLS + MP
Impact of Branch Prediction on TLS
TLS emulates wider processor:– Removing mispredictions important
(Amdahl)
40University of Manchester - 2010
Branch Entropy for TLS
Much harder for TLS:– History partitioning– History re-order
41University of Manchester - 2010
Increasing the Size of the Branch Predictor
Aliasing not much of a problem Fundamental limitation is lack of
history
42University of Manchester - 2010
Designing a Better Predictor
Predictors that exploit longer histories not necessarily better ..
43University of Manchester - 2010
44
Comparing TLS, MP and Combined TLS/MP
University of Manchester - 2010
45
Comparing TLS, MP and Combined TLS/MP
Additive benefits; no point in doubling the predictor
University of Manchester - 2010
46
Comparing TLS, MP and Combined TLS/MP
Additive benefits; no point in doubling the predictor
9.2% over TLS, 28.2% over MPUniversity of Manchester - 2010
Pipeline Flushes
Significant amount of flush reductions More than base MP!
47University of Manchester - 2010
University of Manchester - 2010 48
Outline
Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions
Also in the ICS’09 paper …
Dealing with the load of the system Converting TLS threads to HT Multiple HT Effect of a better VP Detailed comparison of performance
model against existing models (Renau et. al ICS ’05)
University of Manchester - 2010 49
Also in the HPCA’10 paper …
Detailed HW description Impact of scheduling Limiting MP to DP Effect of scaling Effect of a better CE
50University of Manchester - 2010
University of Manchester - 2010 51
Conclusions
CMPs are here to stay:– What about single threaded apps. and apps with
significant seq. sections?– We advocate the use of speculative multithreading
Different apps. require different SM techniques– Even within apps. different phases
We propose the first mixed execution model– TLS is nicely complemented by HT, RA, and MP
Combined approaches outperform existing SM models:– TLS+HT+RA: TLS by 10.2% avg. (up to 41.2%) and RA
by 18.3 % avg. (up to 35.2%)– TLS+MP: TLS by 9.2% avg. (up to 23.2%) and MP by
28.2 % avg. (up to 138%)
Mixed Speculative Multithreaded Execution
ModelsMarcelo Cintra
University of Edinburghhttp://www.homepages.inf.ed.ac.uk/mc/
Projects/VESPA
Backup Slides
University of Manchester - 2010 53
University of Manchester - 2010 54
Effect of Prefetching
Our HTs do a better job than an aggressive prefetcher! Prefetching helps our scheme as much as it helps base
TLS
System Utilization of Base TLS
90% of the time TLS uses 1 or 2 cores
University of Manchester - 2010 55
Hardware Cost
Last Value predictor – 16KB– Can be made smaller, although it
helps a lot Confidence Estimator – 2Kb
– Helps mainly on data dependent branches
Extra bit in thread context information – 1bit/thread
University of Manchester - 2010 56
Prediction Stats
University of Manchester - 2010 57
Stat. (%) Bzip2 Crafty Gap Gzip Mcf Parser Twolf Vortex Vpr Avg.Misp. 5.7 5.2 3.3 5.1 3.9 3.4 10 0.3 6.6 4.8PVN 22.8 16.9 19.5 24.1 27.9 20.8 23.2 11.6 24.4 21.3PVP 98.2 97.6 98.8 98.6 99.2 98.9 96.4 99.8 98 98.4SPEC 90.7 89.1 89.7 91.4 91.8 90 91.3 88.5 91 90.4SENS 95 96 97.5 95.4 96.6 97.3 89.5 99.8 93.9 95.7