dynamic performance tuning for speculative threads yangchun luo, venkatesan packirisamy, nikhil...

Dynamic Performance Tuning for Speculative Threads

Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre†,

Ankit Tarkas†, Wei-Chung Hsu, and Antonia

ZhaiDept. of Computer Science and Engineering†Dept. of Electronic Computer Engineering

University of Minnesota – Twin Cities

2

Motivation

Multicore processors brought forth large potentials of computing power

Exploiting these potentials demands thread-level parallelism

Intel Core i7 die photo

3

Exploiting Thread-Level Parallelism

Potentially more parallelism with speculation

dependence

Sequential

Store *p

Load *q

Store *p

Time

p != q

Thread-Level Speculation (TLS)

Traditional Parallelization

Load *q

p != q ??

Store 88

Load 20

Parallel execution

Store 88

Load 88

Speculation Failure

Time

Time

Load 88

Compiler gives up

p == q

But Unpredictable

4

Addressing UnpredictabilityState-of-the-art approach: Profiling-directed optimization Compiler analysis

Source codeProfiling run

Profiling info

Region selection

Parallel code generation

Feedback

Machine code

A typical compilation framework

Speculative code optimization

5

Problem #1

01

23

3

Time signal

allowcommit

wait

spawning thread

inter-thread synchronization

speculation failure

……load

imbalance

4

Hard to model TLS overhead statically, even with profiling

6

Problem #2

Profiling Behavior

Training Input

Actual Input

Actual Behavior

Profile-based decision may not be appropriate for actual input

7

Problem #3

Timephase1 phase2

Met

ric

phase3

Behavior of the same code changes over time

8

Problem #4

Load *p=0x88

Prefetching

Extra misses

Pollution

Load 0x22 miss

Load 0x22 hit

Load 0x22 Load 0x22miss miss

Sequential Parallel

P P P

How important are cache impacts?

L1$ L1$ L1$

PL1$

Load *p’=0x88

PL1$

PL1$

PL1$

Load *p=0x80

Load *p’=0x08

9

SEQ TLS-4core0

0.2

0.4

0.6

0.8

1

CacheSquashOther

Nor

mal

ized

Exe

cutio

n Ti

me

Impact on Cache Performance

60%

load *p=0x88

Cache: p == p’Prefetched (+)

load *p’=0x88

Cache impact is difficult to model statically

A major loop from benchmark art (scanner.c:594)

2.5X

10

Our proposalProposed Performance Tuning Framework

Compiler Analysis Runtime Analysis

Source code

Profiling info

Region selection

Parallel code generation

Machine code

Speculative code optimization

Accurate Timing & Cache Impacts

Adaptations for inputs and

phase changes

Performance Evaluation

Tuning PolicyCompiler Annotation

RuntimeInformation

Dynamic Execution

Performance Profile Table

11

Parallelization Decisions at Runtime

…

Evaluate performance

Thread spawning End of a parallel execution

spawn

parallelize

spawnspawn

serialize

…

spawn

Parallel execution Sequential execution

Parallel Execution

New execution model facilitates dynamic tuning

Performance Profile Table

12

Evaluating Performance ImpactRuntime Analysis

Simple metric: cost of speculative failure

Comprehensive evaluation: sequential performance prediction

SEQ TLS-4core0

0.2

0.4

0.6

0.8

1

CacheSquashOther

Nor

mal

ized

Exe

cutio

n Ti

me


Tuning Policy

RuntimeInformation

Dynamic ExecutionPerformance Profile Table

13

Sequential Performance Prediction

PredictP P

Programmable Cycle Classifier

Information from ROB

Hardware Counters

iFetch

tick

CacheBusy

ExeStall

SpeculativeParallel

Predicted sequential

Busy

iFetch

ExeStall

Cache

TLSoverhead

Performance Impact

14

Cache Behavior: Scenario #1

TLS

Count the miss in squashed execution

load p(unmodified)

miss

load p hit

Predicted

load p hit

SEQ

load p miss

T0

T0

T0 T0

Squash

15

Cache Behavior: Scenario #2

TLS

load p miss

load p miss

Predicted

load p miss*2

SEQ

load p missstore p

invalidate p

store p

Not count a miss if tag matched invalid status

store p

T0 T0 T0

T0

16

Tune the performance

Exhaustive search

Prioritized search

Runtime Analysis


Tuning Policy

RuntimeInformation

Dynamic ExecutionPerformance Profile Table

17

How to prioritize the searchSearch order: Inside-Out Compiler suggestion Outside-In

Performance summary metric: Speedup or not? Improvement in cycles

Search order

Perf

orm

ance

Su

mm

ary

met

ric

Spee

dup

or n

ot?

Inside-Out Compilersuggestion

Impr

ove.

in c

ycle

sMore

Aggressive

Quantitative+StaticHintQuantitative

Simple

Suboptimal

Evaluation Infrastructure

Benchmarks· SPEC2000 written in C, -O3 optimization

Underlying architecture· 4-core, chip-multiprocessor (CMP)· speculation supported by coherence

Simulator· Superscalar with detailed memory model· simulates communication latency· models bandwidth and contention

Detailed, cycle-accurate simulation

C

C

P

C

P

Interconnect

C

P

C

P

18

19

Comparing Tuning Policies

ammp artbzip

2cra

fty

equake gapgcc gzip mcf

mesaparse

r

perlbmk

twolf

vorte

xvp

r-p vpr-r

G.M.

0

0.5

1

1.5

2

2.5

3

Simple Quantitative Quantitative+StaticHint

Spee

dup

wrt

. SEQ

1.17x

1.23x

1.37x

Parallel Code Overhead

20

Profile-Based vs. Dynamic

ammp artbzip

2cra

fty

equake gapgcc gzip mcf

mesaparse

r

perlbmk

twolf

vorte

xvp

r-p vpr-r

G.M.

0

0.5

1

1.5

2

2.5

3

ProfileBased Quant+StaticHintProfileBased/ProfileBasedSEQ (Quant+StaticHint)/dynamicSEQ

Spee

dup

wrt

. SEQ

Dynamic outperformed static by ≈10% (1.37x/1.25x)

1.25x

Potentially go up to ≈15% (1.46x/1.27x)

1.27x

1.46x

21

Related works: Region Selection

Runtime Loop Detection

[Tubella HPCA’98]

[Marcuello ICS’98]

Compiler Heuristics

[Vijaykumar Micro’98]

Statically Dynamically

Saving Power from Useless Re-spawning

[Renau ICS’05]

Simple Profiling Pass

[Liu PPoPP’06] (POSH)

Extensive Profiling

[Wang LCPC’05]

Balanced Min-Cut

[Johnson PLDI’04]

Profile-Based Search

[Johnson PPoPP’07]

Lack of adaptability Lack of high-levelinformation

Dynamic Performance Tuning[Luo ISCA’09] (this work)

22

Conclusions

Deciding how to extract speculative threads at runtimeQuantitatively summarize performance profileCompiler hints avoid suboptimal decisions

Compiler and hardware work together. 37% speedup over sequential execution ≈10% speedup over profile-based decisions

Dynamic performance tuning can improve TLS efficiency

Configurable HPMs enable efficient dynamic optimizations

dynamic performance tuning for speculative threads yangchun luo, venkatesan packirisamy, nikhil...

Documents