mixed speculative multithreaded execution models marcelo cintra university of edinburgh

Mixed Speculative Multithreaded Execution

ModelsMarcelo Cintra

University of Edinburghhttp://www.homepages.inf.ed.ac.uk/mc/

Projects/VESPA

University of Manchester - 2010 2

Context and Motivation

Multi-cores are here to stay and many-cores are coming

Excellent performance for embarrassingly parallel or throughput workloads. Otherwise, ...

Core Duo(Yonah - 2006)

Core i7(Lynnfield - 2010) SCC

(2015?)



Future many-cores will have many idle cores too often– Not enough applications– Not enough benefit from using more “explicit user threads”

Our proposal: use spare cores to accelerate whatever threads are available– Create implicit threads to run in parallel with main explicit

user threads– Accelerate user threads through increased coarse-grain

overlap (i.e., TLP) or increased fine-grain overlap (i.e., ILP)– Combine previously proposed speculative multithreading

techniques: thread-level speculation (TLS), helper threads (HT), run-ahead execution (RA), and multi-path execution (MP)



Why is combining SM schemes a good idea?– No speculative multithreading scheme alone is good enough– Hardware support for all schemes is very similar

Expected end-result?– Better performance– More “effective” many-core experience:

With 4-way speculative multithreading (i.e., 4 implicit SM threads for each explicit user thread) a 64 core “unwieldy” many-core is as “easy to handle” as a 16 core system

How about power efficiency?– Speculation can be made less inefficient (we’re working on

it)– Power can be smartly allocated (see our IPDPS’10 paper)


Contributions

Introduce mixed Speculative Multithreading (SM) Execution Models

Design and evaluated two combinations: TLS+HT+RA [ICS’09] and TLS+MP [HPCA’10]

Propose a performance model able to quantify ILP and TLP benefits

Combined approaches outperform state-of-the-art SM models:– TLS+HT+RA: TLS by 10.2% avg. (up to 41.2%) and

RA by 18.3 % avg. (up to 35.2%)– TLS+MP: TLS by 9.2% avg. (up to 23.2%) and MP by

28.2 % avg. (up to 138%)


Outline

Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions

Speculative Multithreading Basic Idea: Use idle cores/contexts to

speculate on future application needs– TLS: speculatively execute parallel threads– HT/RA: speculatively perform future memory

operations– MP: speculatively execute along multiple branch

targets Speculative threads supported in hardware Compiler support not essential, but can be

useful Hardware infrastructure is very similar



Thread Level Speculation

Compiler deals with:– Task selection– Code generation

HW deals with:– Different context– Spawn threads– Detecting violations– Replaying – Arbitrate commit

Benefit: TLP/ILP– TLP (Overlapped

Execution) + ILP (Prefetching)

Helper Threads

Compiler deals with:– Memory ops miss/

hard-to-predict branches

– Backward slices

HW deals with:– Spawn threads– Different context– Discard when

finished

Benefit:– ILP

(Prefetch/Warmup) University of Manchester - 2010 9

RunAhead Execution

Compiler deals with:– Nothing

HW deals with:– Different context– When to do RA– VP Memory– Commit/Discard

Benefit:– ILP (Prefetch/Warmup)


MultiPath Execution

Compiler deals with:– Nothing

HW deals with:– Different context– When to do MP– Discard wrong path

Benefit:– ILP (Branch Pred.)

11University of Manchester - 2010

MainThread

Tim

e

Correct Paths

Wrong Paths

BranchMisp. Cost


Outline



Combining TLS, HT and RA

Start with TLS Provide support to clone TLS threads and

convert them to HT Conversion to HT means:

– Put them in RA mode– Suppress squashes and do not cause additional

squashes– Discard them when they finish

No compiler slicing purely HW approach

Intricacies to be Handled HT may not prefetch effectively! Dealing with contention

– HT threads much faster saturate BW

Dealing with thread ordering– TLS imposes total thread order– HT killed squashes TLS threads


Creating and Terminating HT Create a HT on a L2 miss we can VP

– Use mem. address based confidence estimator– VP only if confident

Create a HT if we have a free processor Only allow most speculative thread to clone

– Seamless integration of HT with TLS– BUT: if parent no longer the most spec. TLS

thread, the HT has to be killed Additionally kill HT when:

– Parent/HT thread finishes– HT causes exception



Outline


Mixed Execution Model When idle resources:

– Try MP on top of TLS!! Map TLS threads on empty cores Map MP threads on empty contexts (same

core)

Minimal extra HW:– Branch confidence estimator– MP bit – thread on MP mode – PATHS – how many outstanding

branches– DIR – which path thread followed


Combined TLS/MP Model


Thread 1

Thread 2

SpeculativeT

ime



Thread 1

Thread 2

SpeculativeT

ime

Low Confidence Branch

Thread 1

MP: 0PATHS: 000DIR: 000



Thread 1a

Thread 2

SpeculativeT

ime

Multi-PathMode

Thread 1a


Thread 1b


Thread 1b



Thread 1a

Thread 2

SpeculativeT

ime

Branch Resolved

Thread 1b

Thread 1a


Thread 1b


Intricacies to be Handled How do we map TLS/MP threads?

– Different mapping policies for TLS threads Dealing with thread ordering

– Correct data forwarding Dealing with violations

– While in “MP-Mode” delay restarts/kills/commits

– No squashes on the wrong path Thread spawning:

– Delayed as well – keep contention low



Outline



Understanding Performance Benefits Complex TLS thread interactions,

obscure performance benefits Even more true for mixed execution

models We need a way to quantify ILP and TLP

contributions to bottom-line performance

Proposed model:– Able to break benefits in ILP/TLP

contributions

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)


Tseq/Tmt

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)


Tseq/T1p

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)3. Compute speedup due to ILP (Silp)


(T1+T2)/(T1’+T2’)

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)3. Compute speedup due to ILP (Silp)4. Use everything to compute TLP (Sovl)


Sall/(Sseq x Silp)


Outline



Experimental Setup

Simulator, Compiler and Benchmarks:– SESC (http://sesc.sourceforge.net/)– POSH (Liu et al. PPoPP ‘06)– Spec 2000 Int.

Architecture: (for TLS+HT+RA scheme)– Four way CMP, 4-Issue cores– 16KB L1 Data (multi-versioned) and Instruction Caches– 1MB unified L2 Caches– Inst. window/ROB – 80/104 entries– 16KB Last Value Predictor

http://sesc.sourceforge.net/

31

Experimental Setup

Simulator, Compiler and Benchmarks:– SESC (http://sesc.sourceforge.net/)– POSH (Liu et al. PPoPP ‘06)– Spec 2000 Int.

Architecture: (for TLS+MP scheme)– Four way CMP, 4-Issue cores, 6 contexts / core– 32K-bit OGEHL, 1KByte BTB, 32-Entry RAS– 8 Kbit enhanced JRS confidence estimator– 32KB L1 Data (multi-versioned) and Instruction Caches– 1MB unified L2 Caches

University of Manchester - 2010

http://sesc.sourceforge.net/


Results I

TLS + HT + RA


Comparing TLS, RunAhead and Unified Scheme



Almost additive benefits



Almost additive benefits 10.2% over TLS, 18.3% over RA

Understanding the extra ILP Improvements of ILP come from:

– Mainly memory – Branch prediction (improvement

0.5%) Focus on memory:

– Miss rate on committed path– Clustering of misses (different cost)


Normalized Shared Cache Misses

All schemes better than sequential Unified 41% better than sequential


Isolated vs. Clustered Misses

. Both TLS + RA Large window

machines Unified does even better



Results II

TLS + MP

Impact of Branch Prediction on TLS

TLS emulates wider processor:– Removing mispredictions important

(Amdahl)


Branch Entropy for TLS

Much harder for TLS:– History partitioning– History re-order


Increasing the Size of the Branch Predictor

Aliasing not much of a problem Fundamental limitation is lack of

history


Designing a Better Predictor

Predictors that exploit longer histories not necessarily better ..


44

Comparing TLS, MP and Combined TLS/MP


45


Additive benefits; no point in doubling the predictor


46


Additive benefits; no point in doubling the predictor

9.2% over TLS, 28.2% over MPUniversity of Manchester - 2010

Pipeline Flushes

Significant amount of flush reductions More than base MP!



Outline


Also in the ICS’09 paper …

Dealing with the load of the system Converting TLS threads to HT Multiple HT Effect of a better VP Detailed comparison of performance

model against existing models (Renau et. al ICS ’05)


Also in the HPCA’10 paper …

Detailed HW description Impact of scheduling Limiting MP to DP Effect of scaling Effect of a better CE



Conclusions

CMPs are here to stay:– What about single threaded apps. and apps with

significant seq. sections?– We advocate the use of speculative multithreading

Different apps. require different SM techniques– Even within apps. different phases

We propose the first mixed execution model– TLS is nicely complemented by HT, RA, and MP

Combined approaches outperform existing SM models:– TLS+HT+RA: TLS by 10.2% avg. (up to 41.2%) and RA

by 18.3 % avg. (up to 35.2%)– TLS+MP: TLS by 9.2% avg. (up to 23.2%) and MP by

28.2 % avg. (up to 138%)

Mixed Speculative Multithreaded Execution

ModelsMarcelo Cintra

University of Edinburghhttp://www.homepages.inf.ed.ac.uk/mc/

Projects/VESPA

Backup Slides



Effect of Prefetching

Our HTs do a better job than an aggressive prefetcher! Prefetching helps our scheme as much as it helps base

TLS

System Utilization of Base TLS

90% of the time TLS uses 1 or 2 cores


Hardware Cost

Last Value predictor – 16KB– Can be made smaller, although it

helps a lot Confidence Estimator – 2Kb

– Helps mainly on data dependent branches

Extra bit in thread context information – 1bit/thread


Prediction Stats


Stat. (%) Bzip2 Crafty Gap Gzip Mcf Parser Twolf Vortex Vpr Avg.Misp. 5.7 5.2 3.3 5.1 3.9 3.4 10 0.3 6.6 4.8PVN 22.8 16.9 19.5 24.1 27.9 20.8 23.2 11.6 24.4 21.3PVP 98.2 97.6 98.8 98.6 99.2 98.9 96.4 99.8 98 98.4SPEC 90.7 89.1 89.7 91.4 91.8 90 91.3 88.5 91 90.4SENS 95 96 97.5 95.4 96.6 97.3 89.5 99.8 93.9 95.7

mixed speculative multithreaded execution models marcelo cintra university of edinburgh

Documents