1 improving value communication…steffan carnegie mellon improving value communication for...

1Improving Value Communication… SteffanCarnegie Mellon

Improving Value Communication for Improving Value Communication for Thread-Level SpeculationThread-Level Speculation

Greg Steffan, Chris Colohan, Greg Steffan, Chris Colohan,

Antonia Zhai, and Todd MowryAntonia Zhai, and Todd Mowry

School of Computer ScienceSchool of Computer Science

Carnegie Mellon UniversityCarnegie Mellon University


Multithreaded Machines Are EverywhereMultithreaded Machines Are Everywhere

How can we use them? Parallelism!

C

P

C

SUN MAJC, IBM Power4,Sibyte SB-1250

ALPHA 21464,Intel Xeon

Threads

C

C

P

C

P


Automatic ParallelizationAutomatic Parallelization

Proving independence of threads is hard:Proving independence of threads is hard:

– complex control flowcomplex control flow

– complex data structurescomplex data structures

– pointers, pointers, pointerspointers, pointers, pointers

– run-time inputsrun-time inputs

How can we make the compiler’s job feasible?How can we make the compiler’s job feasible?

Thread-Level Speculation (TLS)


Retry

TLS

E1 E2 E3

Load

Thread-Level SpeculationThread-Level Speculation

Epoch1

Epoch2

Epoch3

exploit available thread-level parallelism

Load

StoreTime


SpeculateSpeculate

good when p != q

Store *p

Load *q

E1E2

Memory


Synchronize (and forward)Synchronize (and forward)

good when p == q

Store *p

Load *q

E1E2

Memory

Signal

Wait(stall)

Store *pLoad *q

E1E2

Memory

(Speculate)


Reduce the Critical Forwarding PathReduce the Critical Forwarding Path

Wait

Load X

Store X

Signal

Overview Big Critical Path Small Critical Path

decreases execution time

critical

path

stall execution time

execution time


PredictPredict

good when p == q and

*q is predictable

Store *p

Load *q

E1E2

Memory

Value

Predictor

Store *p

Load *q

E1E2

Memory

SignalWait(stall)

(Synchronize)

Store *pLoad *q

E1E2

Memory

(Speculate)


Improving on Compile-Time DecisionsImproving on Compile-Time Decisions

Predict

Speculate

Synchronize

Compiler

Speculate

Synchronize

Hardware

reduce critical

forwarding path

reduce critical

forwarding path

is there any potential benefit?


Potential for Improving Value CommunicationPotential for Improving Value Communication

efficient value communication is key

U=Un-optimized, P=Perfect Prediction (4 Processors)


OutlineOutline

Our Support for Thread-Level SpeculationOur Support for Thread-Level Speculation

– Compiler SupportCompiler Support

– Experimental FrameworkExperimental Framework

– Baseline PerformanceBaseline Performance

• Techniques for Improving Value CommunicationTechniques for Improving Value Communication

• Combining the TechniquesCombining the Techniques

• ConclusionsConclusions


Compiler Support (SUIF1.3 and gcc)Compiler Support (SUIF1.3 and gcc)

1) Where to speculate1) Where to speculate

– use profile information, heuristics, loop unrollinguse profile information, heuristics, loop unrolling

2) Transforming to exploit TLS2) Transforming to exploit TLS

– insert new TLS-specific instructionsinsert new TLS-specific instructions

– synchronizes/forwards register valuessynchronizes/forwards register values

3) Optimization3) Optimization

– eliminate dependences due to loop induction variableseliminate dependences due to loop induction variables

– algorithm to schedule the critical forwarding pathalgorithm to schedule the critical forwarding path

compiler plays a crucial role


Experimental FrameworkExperimental Framework

BenchmarksBenchmarks

– from SPECint95 and SPECint2000, -O3 optimizationfrom SPECint95 and SPECint2000, -O3 optimization

Underlying architectureUnderlying architecture

– 4-processor, single-chip multiprocessor4-processor, single-chip multiprocessor

– speculation supported by coherencespeculation supported by coherence

SimulatorSimulator

– superscalar, similar to superscalar, similar to MIPS R10KMIPS R10K

– models all bandwidth and contentionmodels all bandwidth and contention

detailed simulation!

C

C

P

C

P

Crossbar


Compiler PerformanceCompiler Performance

S=Sequential



overheads of TLS compilation can be significant

S=Sequential, T=TLS Run Sequentially



much failed speculation and sync. stall

S=Seq., T=TLS Seq., U=Un-optimized



compiler optimization is effective

S=Seq., T=TLS Seq., U=Un-optimized, B=Compiler Optimized


OutlineOutline


Techniques for Improving Value CommunicationTechniques for Improving Value Communication

– When Prediction is BestWhen Prediction is Best

• Memory Value PredictionMemory Value Prediction

• Forwarded Value PredictionForwarded Value Prediction

• Silent StoresSilent Stores

– When Synchronization is BestWhen Synchronization is Best




Memory Value PredictionMemory Value Prediction

Store *pLoad *q

E1E2

Memory

avoid failed speculation if *q is predictable

Store *pLoad *q

E1E2

Memory

Value

Predictor

PredictionWith Value


Value Predictor ConfigurationValue Predictor Configuration

Context

Stride

Confidence

Confidence

Aggressive hybrid predictorAggressive hybrid predictor

– 1K x 3-entry context and 1K-entry stride 1K x 3-entry context and 1K-entry stride

– 2-bit, up/down, saturating confidence counters2-bit, up/down, saturating confidence counters

predict only when confident

no prediction

>?

load PC

predicted value


Throttling PredictionThrottling Prediction

Only predict exposed loadsOnly predict exposed loads

– hardware tracks which words are speculatively modifiedhardware tracks which words are speculatively modified

– use to determine whether a load is exposed use to determine whether a load is exposed

predict only exposed loads

Store X

E1

Load X Load X

E2

not exposed exposed



exposed loads are fairly predictable



must throttle further

B=Baseline, E=Predict Exposed Loads



effective if properly throttled

B=Baseline, E=Predict Exposed Lds, V=Predict Violating Loads


Forwarded Value PredictionForwarded Value Prediction

Store X

Load X

E1E2

SignalWait

avoid synchronization stall if X is predictable

Store XLoad X

E1E2

Value

Predictor

PredictionWith Value

(stall)



forwarded values are also fairly predictable



B=Baseline, F=Predict Forwarded Values



only predict loads that have caused stalls

B=Baseline, F=Predict Forwarded Val’s, S=Predict Stalling Val’s


Silent StoresSilent Stores

Store X=5Load X

E1E2

Memory (X=5)

(Store X=5) Load X

E1E2

Memory (X=5)

ExploitingSilentStores

avoid failed speculation if store is silent

Load X==5?



silent stores are prevalent


Impact of Exploiting Silent StoresImpact of Exploiting Silent Stores

most of the benefits of memory value prediction

B=Baseline, SS=Exploit Silent Stores


OutlineOutline



When Prediction is BestWhen Prediction is Best

– When Synchronization is BestWhen Synchronization is Best

• Hardware-Inserted Dynamic SynchronizationHardware-Inserted Dynamic Synchronization

• Reducing the Critical Forwarding PathReducing the Critical Forwarding Path




Hardware-Inserted Dynamic SynchronizationHardware-Inserted Dynamic Synchronization

Store *pLoad *q

E1E2

Memory

avoid failed speculation

WithDynamic

Sync.

Store *p

Load *q

E2

E1

(stall)

Memory



B=Baseline, D=Synchronize Violating Loads



B=Baseline, D=Sync. Violating Ld.s, R=D+Reset



overall average improvement of 9%

B=Baseline, D=Sync. Violating Ld.s, R=D+Reset, M=R+Minimum


Reduce the Critical Forwarding PathReduce the Critical Forwarding Path

Wait

Load X

Store X

Signal

Overview Big Critical Path Small Critical Path

decreases execution time

critical

path

stall execution time

execution time


Prioritizing the Critical Forwarding PathPrioritizing the Critical Forwarding Path

Load r1=X

Store r2,XSignal

op r2=r1,r3op r5=r6,r7op r6=r5,r8

cri

tic

al p

ath

• mark the input-chain of the critical storemark the input-chain of the critical store

• give marked instructions high issue prioritygive marked instructions high issue priority

Load r1=X

Store r2,XSignal

op r2=r1,r3

op r5=r6,r7op r6=r5,r8

cri

tic

al p

ath

PrioritizationWith


Critical Path PrioritizationCritical Path Prioritization

some reordering


Impact of Prioritizing the Critical PathImpact of Prioritizing the Critical Path

not much benefit, given the complexity

B=Baseline, S=Prioritizing Critical Path


OutlineOutline



Combining the TechniquesCombining the Techniques




Techniques are orthogonal with one exception:Techniques are orthogonal with one exception:

Memory value prediction and dynamic sync.Memory value prediction and dynamic sync.

– only synchronize memory values that are unpredictableonly synchronize memory values that are unpredictable

– dynamic sync. logic checks prediction confidencedynamic sync. logic checks prediction confidence

– synchronize if not confidentsynchronize if not confident



B=Baseline



B=Baseline, A=All But Dynamic Synchronization

significant improvement



B=Baseline, A=All But Dynamic Synchronization, D=All

good for some, bad for others



close to ideal for m88ksim and vpr

B=Baseline, A=All But Dyn. Sync., D=All, P=Perfect Prediction


ConclusionsConclusions

Prediction Prediction

– memory value predictionmemory value prediction: effective when throttled: effective when throttled

– forwarded value predictionforwarded value prediction: effective when throttled: effective when throttled

– silent storessilent stores: prevalent and effective: prevalent and effective

SynchronizationSynchronization

– dynamic synchronizationdynamic synchronization: can help or hurt: can help or hurt

– hardware prioritizationhardware prioritization: ineffective, if compiler is good: ineffective, if compiler is good

prediction is effective

synchronization has mixed results


BACKUPSBACKUPS


GoalsGoals

1) Parallelize general-purpose programs1) Parallelize general-purpose programs

– difficult problemdifficult problem

2) Keep hardware support simple and minimal2) Keep hardware support simple and minimal

– avoid large, specialized structuresavoid large, specialized structures

– preserve the performance of non-TLS workloadspreserve the performance of non-TLS workloads

3) 3) Take full advantage of the compilerTake full advantage of the compiler

– region selection, synchronization, optimizationregion selection, synchronization, optimization


Potential for Further ImprovementPotential for Further Improvement



point


Pipeline ParametersPipeline Parameters

Issue WidthIssue Width 44

Functional UnitsFunctional Units 2Int, 2FP, 1Mem, 1Bra2Int, 2FP, 1Mem, 1Bra

Reorder Buffer SizeReorder Buffer Size 128128

Integer MultiplyInteger Multiply 12 cycles12 cycles

Integer DivideInteger Divide 76 cycles76 cycles

All Other IntegerAll Other Integer 1 cycle1 cycle

FP DivideFP Divide 15 cycles15 cycles

FP Square RootFP Square Root 20 cycles20 cycles

All Other FPAll Other FP 2 cycles2 cycles

Branch PredictionBranch Prediction GShare (16KB, 8 history bits)GShare (16KB, 8 history bits)


Memory ParametersMemory Parameters

Cache Line SizeCache Line Size 32B32B

Instruction CacheInstruction Cache 32KB, 4-way set-assoc32KB, 4-way set-assoc

Data CacheData Cache 32KB, 2-way set-assoc, 2 banks32KB, 2-way set-assoc, 2 banks

Unified Secondary CacheUnified Secondary Cache 2MB, 4-way set-assoc, 4 banks 2MB, 4-way set-assoc, 4 banks

Miss HandlersMiss Handlers 16 for data, 2 for insts16 for data, 2 for insts

Crossbar InterconnectCrossbar Interconnect 8B per cycle per bank8B per cycle per bank

Minimum Miss Latency to Minimum Miss Latency to Secondary CacheSecondary Cache

10 cycles10 cycles

Minimum Miss Latency to Local Minimum Miss Latency to Local MemoryMemory

75 cycles75 cycles

Main Memory BandwidthMain Memory Bandwidth 1 access per 20 cycles1 access per 20 cycles


When Prediction is BestWhen Prediction is Best

Predicting under TLSPredicting under TLS

– only update predictor for successful epochsonly update predictor for successful epochs

– cost of misprediction is high: must re-execute epochcost of misprediction is high: must re-execute epoch

– each epoch requires a logically-separate predictoreach epoch requires a logically-separate predictor

Differentiation from previous work:Differentiation from previous work:

– loop induction variables optimized by compilerloop induction variables optimized by compiler

– larger regions of code, hence larger number of larger regions of code, hence larger number of memory dependences between epochsmemory dependences between epochs


Benchmark Statistics: SPECint2000Benchmark Statistics: SPECint2000

Application Application

NameName

Portion of Portion of Dynamic Dynamic

Execution Execution ParallelizedParallelized

Number of Number of Unique Unique

Parallelized Parallelized RegionsRegions

Average Average Epoch Epoch Size Size

(dynamic (dynamic insts)insts)

Average Average Number of Number of Epochs Per Epochs Per

Dynamic region Dynamic region InstanceInstance

BZIP2BZIP2 98.1%98.1% 11 251.5251.5 451596.0451596.0

CRAFTYCRAFTY 36.1%36.1% 3434 30.830.8 1315.71315.7

GZIPGZIP 70.4%70.4% 11 1307.01307.0 2064.82064.8

MCFMCF 61.0%61.0% 99 206.2206.2 198.9198.9

PARSERPARSER 36.4%36.4% 4141 271.1271.1 19.419.4

PERLBMKPERLBMK 10.3%10.3% 1010 65.165.1 2.42.4

VORTEX2KVORTEX2K 12.7%12.7% 66 1994.31994.3 3.43.4

VPRVPR 80.1%80.1% 66 90.290.2 6.36.3


Benchmark Statistics: SPECint95Benchmark Statistics: SPECint95

Application Application NameName

Portion of Portion of Dynamic Dynamic

Execution Execution ParallelizedParallelized

Number of Number of Unique Unique

Parallelized Parallelized RegionsRegions

Average Average Epoch Size Epoch Size (dynamic (dynamic

insts)insts)

Average Average Number of Number of Epochs Per Epochs Per

Dynamic region Dynamic region InstanceInstance

COMPRESSCOMPRESS 75.5%75.5% 77 188.2188.2 68.468.4

GOGO 31.3%31.3% 4040 2252.72252.7 56.256.2

IJPEGIJPEG 90.6%90.6% 2323 1499.81499.8 33.833.8

LILI 17.0%17.0% 33 176.4176.4 124.9124.9

M88KSIMM88KSIM 56.5%56.5% 66 840.4840.4 50.250.2

PERLPERL 43.9%43.9% 44 137.3137.3 2.22.2




Avg. Exposed Avg. Exposed Loads per EpochLoads per Epoch IncorrectIncorrect CorrectCorrect

Not Not ConfidentConfident

COMPRESSCOMPRESS 12.012.0 0.3%0.3% 31.8%31.8% 67.9%67.9%

CRAFTYCRAFTY 4.54.5 3.0%3.0% 48.6%48.6% 48.3%48.3%

GOGO 7.87.8 2.5%2.5% 41.2%41.2% 56.2%56.2%

GZIPGZIP 66.666.6 1.4%1.4% 52.8%52.8% 45.7%45.7%

M88KSIMM88KSIM 7.57.5 1.2%1.2% 90.9%90.9% 7.7%7.7%

MCFMCF 2.52.5 1.7%1.7% 34.9%34.9% 63.3%63.3%

PARSERPARSER 3.63.6 3.2%3.2% 48.7%48.7% 48.0%48.0%

VORTEX2KVORTEX2K 25.425.4 2.8%2.8% 64.9%64.9% 32.2%32.2%

VPRVPR 6.36.3 3.6%3.6% 49.8%49.8% 46.4%46.4%

exposed loads are quite predictable


Throttling Prediction FurtherThrottling Prediction Further

cache

tag

Load PC

Load PC

Load PC

Load PC

Exposed

Load Table

On an exposed load:

only predict violating loads

Load PC

Load PC

On a dependence violation:

Load PC

Load PC

Load PC

Load PC

Exposed

Load Table

cache

tag

Violating

Loads List



Application Application NameName IncorrectIncorrect CorrectCorrect

Not Not ConfidentConfident

COMPRESSCOMPRESS 3.7%3.7% 31.2%31.2% 65.1%65.1%

CRAFTYCRAFTY 5.5%5.5% 24.6%24.6% 69.7%69.7%

GOGO 3.7%3.7% 28.3%28.3% 67.9%67.9%

GZIPGZIP 0.2%0.2% 98.0%98.0% 1.6%1.6%

M88KSIMM88KSIM 5.4%5.4% 91.0%91.0% 3.4%3.4%

MCFMCF 2.5%2.5% 48.5%48.5% 48.9%48.9%

PARSERPARSER 2.8%2.8% 11.6%11.6% 85.5%85.5%

VORTEX2KVORTEX2K 2.2%2.2% 81.9%81.9% 15.7%15.7%

VPRVPR 2.8%2.8% 26.4%26.4% 70.7%70.7%

synchronized loads are also predictable



Application NameApplication NameDynamic, Non-Stack, Dynamic, Non-Stack,


COMPRESSCOMPRESS 80%80%

CRAFTYCRAFTY 16%16%

GOGO 16%16%

GZIPGZIP 4%4%

M88KSIMM88KSIM 57%57%

MCFMCF 19%19%

PARSERPARSER 12%12%

VORTEX2KVORTEX2K 84%84%

VPRVPR 26%26%

silent stores are prevalent


Critical Path PrioritizationCritical Path Prioritization


Issued Insts That Are High Issued Insts That Are High Priority and Issued EarlyPriority and Issued Early

COMPRESSCOMPRESS 7.1%7.1%

CRAFTYCRAFTY 6.8%6.8%

GOGO 12.9%12.9%

GZIPGZIP 3.6%3.6%

M88KSIMM88KSIM 9.1%9.1%

MCFMCF 9.9%9.9%

PARSERPARSER 9.7%9.7%

VORTEX2KVORTEX2K 3.6%3.6%

VPRVPR 4.7%4.7%

significant reordering

1 improving value communication…steffan carnegie mellon improving value communication for...

Documents