mixed speculative multithreaded execution models marcelo cintra university of edinburgh

57
Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/ Projects/VESPA

Upload: brisa-maxfield

Post on 31-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Mixed Speculative Multithreaded Execution

ModelsMarcelo Cintra

University of Edinburghhttp://www.homepages.inf.ed.ac.uk/mc/

Projects/VESPA

Page 2: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 2

Context and Motivation

Multi-cores are here to stay and many-cores are coming

Excellent performance for embarrassingly parallel or throughput workloads. Otherwise, ...

Core Duo(Yonah - 2006)

Core i7(Lynnfield - 2010) SCC

(2015?)

Page 3: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 3

Context and Motivation

Future many-cores will have many idle cores too often– Not enough applications– Not enough benefit from using more “explicit user threads”

Our proposal: use spare cores to accelerate whatever threads are available– Create implicit threads to run in parallel with main explicit

user threads– Accelerate user threads through increased coarse-grain

overlap (i.e., TLP) or increased fine-grain overlap (i.e., ILP)– Combine previously proposed speculative multithreading

techniques: thread-level speculation (TLS), helper threads (HT), run-ahead execution (RA), and multi-path execution (MP)

Page 4: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 4

Context and Motivation

Why is combining SM schemes a good idea?– No speculative multithreading scheme alone is good enough– Hardware support for all schemes is very similar

Expected end-result?– Better performance– More “effective” many-core experience:

With 4-way speculative multithreading (i.e., 4 implicit SM threads for each explicit user thread) a 64 core “unwieldy” many-core is as “easy to handle” as a 16 core system

How about power efficiency?– Speculation can be made less inefficient (we’re working on

it)– Power can be smartly allocated (see our IPDPS’10 paper)

Page 5: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 5

Contributions

Introduce mixed Speculative Multithreading (SM) Execution Models

Design and evaluated two combinations: TLS+HT+RA [ICS’09] and TLS+MP [HPCA’10]

Propose a performance model able to quantify ILP and TLP benefits

Combined approaches outperform state-of-the-art SM models:– TLS+HT+RA: TLS by 10.2% avg. (up to 41.2%) and

RA by 18.3 % avg. (up to 35.2%)– TLS+MP: TLS by 9.2% avg. (up to 23.2%) and MP by

28.2 % avg. (up to 138%)

Page 6: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 6

Outline

Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions

Page 7: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Speculative Multithreading Basic Idea: Use idle cores/contexts to

speculate on future application needs– TLS: speculatively execute parallel threads– HT/RA: speculatively perform future memory

operations– MP: speculatively execute along multiple branch

targets Speculative threads supported in hardware Compiler support not essential, but can be

useful Hardware infrastructure is very similar

University of Manchester - 2010 7

Page 8: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 8

Thread Level Speculation

Compiler deals with:– Task selection– Code generation

HW deals with:– Different context– Spawn threads– Detecting violations– Replaying – Arbitrate commit

Benefit: TLP/ILP– TLP (Overlapped

Execution) + ILP (Prefetching)

Page 9: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Helper Threads

Compiler deals with:– Memory ops miss/

hard-to-predict branches

– Backward slices

HW deals with:– Spawn threads– Different context– Discard when

finished

Benefit:– ILP

(Prefetch/Warmup) University of Manchester - 2010 9

Page 10: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

RunAhead Execution

Compiler deals with:– Nothing

HW deals with:– Different context– When to do RA– VP Memory– Commit/Discard

Benefit:– ILP (Prefetch/Warmup)

University of Manchester - 2010 10

Page 11: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

MultiPath Execution

Compiler deals with:– Nothing

HW deals with:– Different context– When to do MP– Discard wrong path

Benefit:– ILP (Branch Pred.)

11University of Manchester - 2010

MainThread

Tim

e

Correct Paths

Wrong Paths

BranchMisp. Cost

Page 12: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 12

Outline

Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions

Page 13: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 13

Combining TLS, HT and RA

Start with TLS Provide support to clone TLS threads and

convert them to HT Conversion to HT means:

– Put them in RA mode– Suppress squashes and do not cause additional

squashes– Discard them when they finish

No compiler slicing purely HW approach

Page 14: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Intricacies to be Handled HT may not prefetch effectively! Dealing with contention

– HT threads much faster saturate BW

Dealing with thread ordering– TLS imposes total thread order– HT killed squashes TLS threads

University of Manchester - 2010 14

Page 15: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Creating and Terminating HT Create a HT on a L2 miss we can VP

– Use mem. address based confidence estimator– VP only if confident

Create a HT if we have a free processor Only allow most speculative thread to clone

– Seamless integration of HT with TLS– BUT: if parent no longer the most spec. TLS

thread, the HT has to be killed Additionally kill HT when:

– Parent/HT thread finishes– HT causes exception

University of Manchester - 2010 15

Page 16: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 16

Outline

Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions

Page 17: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Mixed Execution Model When idle resources:

– Try MP on top of TLS!! Map TLS threads on empty cores Map MP threads on empty contexts (same

core)

Minimal extra HW:– Branch confidence estimator– MP bit – thread on MP mode – PATHS – how many outstanding

branches– DIR – which path thread followed

17University of Manchester - 2010

Page 18: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Combined TLS/MP Model

18University of Manchester - 2010

Thread 1

Thread 2

SpeculativeT

ime

Page 19: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Combined TLS/MP Model

19University of Manchester - 2010

Thread 1

Thread 2

SpeculativeT

ime

Low Confidence Branch

Thread 1

MP: 0PATHS: 000DIR: 000

Page 20: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Combined TLS/MP Model

20University of Manchester - 2010

Thread 1a

Thread 2

SpeculativeT

ime

Multi-PathMode

Thread 1a

MP: 1PATHS: 001DIR: 000

Thread 1b

MP: 1PATHS: 001DIR: 001

Thread 1b

Page 21: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Combined TLS/MP Model

21University of Manchester - 2010

Thread 1a

Thread 2

SpeculativeT

ime

Branch Resolved

Thread 1b

Thread 1a

MP: 1PATHS: 001DIR: 000

Thread 1b

MP: 0PATHS: 000DIR: 000

Page 22: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Intricacies to be Handled How do we map TLS/MP threads?

– Different mapping policies for TLS threads Dealing with thread ordering

– Correct data forwarding Dealing with violations

– While in “MP-Mode” delay restarts/kills/commits

– No squashes on the wrong path Thread spawning:

– Delayed as well – keep contention low

22University of Manchester - 2010

Page 23: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 23

Outline

Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions

Page 24: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 24

Understanding Performance Benefits Complex TLS thread interactions,

obscure performance benefits Even more true for mixed execution

models We need a way to quantify ILP and TLP

contributions to bottom-line performance

Proposed model:– Able to break benefits in ILP/TLP

contributions

Page 25: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)

University of Manchester - 2010 25

Tseq/Tmt

Page 26: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)

University of Manchester - 2010 26

Tseq/T1p

Page 27: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)3. Compute speedup due to ILP (Silp)

University of Manchester - 2010 27

(T1+T2)/(T1’+T2’)

Page 28: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Performance Model

Sall = Sseq x Silp x Sovl 1. Compute overall speedup (Sall)2. Compute sequential TLS speedup (Sseq)3. Compute speedup due to ILP (Silp)4. Use everything to compute TLP (Sovl)

University of Manchester - 2010 28

Sall/(Sseq x Silp)

Page 29: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 29

Outline

Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions

Page 30: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 30

Experimental Setup

Simulator, Compiler and Benchmarks:– SESC (http://sesc.sourceforge.net/)– POSH (Liu et al. PPoPP ‘06)– Spec 2000 Int.

Architecture: (for TLS+HT+RA scheme)– Four way CMP, 4-Issue cores– 16KB L1 Data (multi-versioned) and Instruction Caches– 1MB unified L2 Caches– Inst. window/ROB – 80/104 entries– 16KB Last Value Predictor

Page 31: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

31

Experimental Setup

Simulator, Compiler and Benchmarks:– SESC (http://sesc.sourceforge.net/)– POSH (Liu et al. PPoPP ‘06)– Spec 2000 Int.

Architecture: (for TLS+MP scheme)– Four way CMP, 4-Issue cores, 6 contexts / core– 32K-bit OGEHL, 1KByte BTB, 32-Entry RAS– 8 Kbit enhanced JRS confidence estimator– 32KB L1 Data (multi-versioned) and Instruction Caches– 1MB unified L2 Caches

University of Manchester - 2010

Page 32: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 32

Results I

TLS + HT + RA

Page 33: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 33

Comparing TLS, RunAhead and Unified Scheme

Page 34: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 34

Comparing TLS, RunAhead and Unified Scheme

Almost additive benefits

Page 35: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 35

Comparing TLS, RunAhead and Unified Scheme

Almost additive benefits 10.2% over TLS, 18.3% over RA

Page 36: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Understanding the extra ILP Improvements of ILP come from:

– Mainly memory – Branch prediction (improvement

0.5%) Focus on memory:

– Miss rate on committed path– Clustering of misses (different cost)

University of Manchester - 2010 36

Page 37: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Normalized Shared Cache Misses

All schemes better than sequential Unified 41% better than sequential

University of Manchester - 2010 37

Page 38: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Isolated vs. Clustered Misses

. Both TLS + RA Large window

machines Unified does even better

University of Manchester - 2010 38

Page 39: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 39

Results II

TLS + MP

Page 40: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Impact of Branch Prediction on TLS

TLS emulates wider processor:– Removing mispredictions important

(Amdahl)

40University of Manchester - 2010

Page 41: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Branch Entropy for TLS

Much harder for TLS:– History partitioning– History re-order

41University of Manchester - 2010

Page 42: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Increasing the Size of the Branch Predictor

Aliasing not much of a problem Fundamental limitation is lack of

history

42University of Manchester - 2010

Page 43: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Designing a Better Predictor

Predictors that exploit longer histories not necessarily better ..

43University of Manchester - 2010

Page 44: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

44

Comparing TLS, MP and Combined TLS/MP

University of Manchester - 2010

Page 45: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

45

Comparing TLS, MP and Combined TLS/MP

Additive benefits; no point in doubling the predictor

University of Manchester - 2010

Page 46: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

46

Comparing TLS, MP and Combined TLS/MP

Additive benefits; no point in doubling the predictor

9.2% over TLS, 28.2% over MPUniversity of Manchester - 2010

Page 47: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Pipeline Flushes

Significant amount of flush reductions More than base MP!

47University of Manchester - 2010

Page 48: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 48

Outline

Introduction Speculative multithreading models Combined TLS+HT+RA scheme Combined TLS+MP scheme Performance model Experimental setup and results Conclusions

Page 49: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Also in the ICS’09 paper …

Dealing with the load of the system Converting TLS threads to HT Multiple HT Effect of a better VP Detailed comparison of performance

model against existing models (Renau et. al ICS ’05)

University of Manchester - 2010 49

Page 50: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Also in the HPCA’10 paper …

Detailed HW description Impact of scheduling Limiting MP to DP Effect of scaling Effect of a better CE

50University of Manchester - 2010

Page 51: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 51

Conclusions

CMPs are here to stay:– What about single threaded apps. and apps with

significant seq. sections?– We advocate the use of speculative multithreading

Different apps. require different SM techniques– Even within apps. different phases

We propose the first mixed execution model– TLS is nicely complemented by HT, RA, and MP

Combined approaches outperform existing SM models:– TLS+HT+RA: TLS by 10.2% avg. (up to 41.2%) and RA

by 18.3 % avg. (up to 35.2%)– TLS+MP: TLS by 9.2% avg. (up to 23.2%) and MP by

28.2 % avg. (up to 138%)

Page 52: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Mixed Speculative Multithreaded Execution

ModelsMarcelo Cintra

University of Edinburghhttp://www.homepages.inf.ed.ac.uk/mc/

Projects/VESPA

Page 53: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Backup Slides

University of Manchester - 2010 53

Page 54: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

University of Manchester - 2010 54

Effect of Prefetching

Our HTs do a better job than an aggressive prefetcher! Prefetching helps our scheme as much as it helps base

TLS

Page 55: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

System Utilization of Base TLS

90% of the time TLS uses 1 or 2 cores

University of Manchester - 2010 55

Page 56: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Hardware Cost

Last Value predictor – 16KB– Can be made smaller, although it

helps a lot Confidence Estimator – 2Kb

– Helps mainly on data dependent branches

Extra bit in thread context information – 1bit/thread

University of Manchester - 2010 56

Page 57: Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Prediction Stats

University of Manchester - 2010 57

Stat. (%) Bzip2 Crafty Gap Gzip Mcf Parser Twolf Vortex Vpr Avg.Misp. 5.7 5.2 3.3 5.1 3.9 3.4 10 0.3 6.6 4.8PVN 22.8 16.9 19.5 24.1 27.9 20.8 23.2 11.6 24.4 21.3PVP 98.2 97.6 98.8 98.6 99.2 98.9 96.4 99.8 98 98.4SPEC 90.7 89.1 89.7 91.4 91.8 90 91.3 88.5 91 90.4SENS 95 96 97.5 95.4 96.6 97.3 89.5 99.8 93.9 95.7