automatic tuning of scientific applications apan qasem ken kennedy rice university houston, tx apan...

25
Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX

Upload: eliana-raphael

Post on 30-Mar-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

Automatic Tuning of Scientific ApplicationsAutomatic Tuning of

Scientific Applications

Apan Qasem Ken KennedyRice University

Houston, TX

Apan Qasem Ken KennedyRice University

Houston, TX

Page 2: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 2

Recap from Last YearRecap from Last Year

• A framework for automatic tuning of applications– Fine grain control of transformations– Feedback beyond whole program execution

time– Parameterized Search Engine– Target: Whole Applications– Search Space

• Multi-loop transformations: e.g Loop Fusion• Numerical Parameters

Page 3: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 3

Recap from Last YearRecap from Last Year

• Experiments with Direct Search– Direct Search able to find suitable tile sizes

and unroll factors by exploring only a small fraction of the search space

• 95% of best performance obtained by exploring 5% of the search space

– Search space pruning needed for making search more efficient

• Wandered into regions containing mostly bad values

• Direct search required more than 30 program evaluations in many cases

Page 4: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 4

Today’s talkToday’s talk

Search Space Pruning

Page 5: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 5

Search Space PruningSearch Space Pruning

• Key Idea:

Search for architecture-dependent model parameters rather than

transformation parameters

• Fundamentally different way of looking at the optimization search space

• Implemented for loop fusion and tiling [Qasem and Kennedy, ICS06]

Page 6: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

6

Register Set

F0 … … … F2L

L1 Cache

Tile Size

Fusion Config.

(L +

1)

dim

en

sio

ns

NL x

2L p

oin

tsT01 … … … T0N

T11 … … … T1N

T21 … … … T2N

T31 … … … T3N

Gray Code Representation

Cost Models

Effective Register

Set

Effective Cache

Capacity

Cost Model

Estimates of architectural parameters

Search SpaceArchitecturalParameters

(L +

1)

dim

en

sio

ns

(N- p

)L x

(2L-q

) poin

ts

New Search Space has only two dimensions!

Page 7: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 7

Our ApproachOur Approach

• Build a combined cost model for fusion and tiling to capture the interaction between the two transformations– Use reuse analysis to estimate trade-offs

• Expose architecture-dependent parameters within the model for tuning through empirical search– Pick T such that

• Working Set Effective Cache Capacity

– Search for suitable Effective Cache Capacity

Page 8: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 8

Tuning ParametersTuning Parameters

• Use a tolerance term to determine how much of a resource we can use at each tuning step

Effective Register Set = T x Register Set Size [0 < T 1]

Effective Cache Capacity = E(a, s, T) [0.01 T 0.20]

Miss Rate

Page 9: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 9

Search StrategySearch Strategy

• Start off conservatively with a low tolerance value and increase tolerance at each step

• Each tuning parameter constitutes a single search dimension

• Search is sequential and orthogonal – stop when performance starts to worsen– use reference values for other dimensions when

searching a particular dimension

Page 10: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 10

Benefits of Pruning StrategyBenefits of Pruning Strategy

• Reduce the size of the exploration search space– Single parameter captures the effects of

multiple transformations• Effective cache capacity for fusion and tiling

choices• Register pressure for fusion and loop unrolling

– Search Space does not grow with program size

• One parameter for all tiled loops in the application

• Correct for inaccuracies in model

Page 11: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 11

Performance Across Performance Across ArchitecturesArchitectures

0.90

0.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

Mean Speedup over baeline

MIPS Itanium Alpha PowerPCPentiumIII

GeoMean

Platforms

model-based native

Page 12: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 12

Performance Comparison Performance Comparison with Direct Searchwith Direct Search

1.0

1.1

1.2

1.3

1.4

1.5

advect3d erle liv18 mgrid

model-based direct

Page 13: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 13

Tuning Time Comparison Tuning Time Comparison with Direct Searchwith Direct Search

0

5

10

15

20

25

30

35

advect3d erle liv18 mgrid

model-based direct

Page 14: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 14

Conclusions and Future WorkConclusions and Future Work

• Approach of tuning for architectural parameters can significantly reduce the optimization search space, while incurring only a small performance penalty

• Extend pruning strategy to cover more– transformations

• Unroll-and-jam• Array Padding

– architectural parameters• TLB

Page 15: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

QuestionsQuestions

Page 16: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

Extra Slides Begin HereExtra Slides Begin Here

Page 17: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 17

Performance Performance Improvement ComparisonImprovement Comparison

80

85

90

95

100

105

0 30 60 90 120 150

Number of Iterations

Fraction of Best Speedup

Exhaustive Search Direct Search Random Search

Page 18: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 18

Tuning Time ComparisonTuning Time Comparison

0

2

4

6

8

10

12

advect3d lud mm vpenta mgrid swim

Benchmarks

Fraction of Tuning Time

DS 30

DS 60

DS 90

DS 120

Page 19: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 19

Feed

back

Next Iteration Parameters

Framework OverviewFramework Overview

Performance Measurement

Tools

TransformedSource

Native Compiler

Architectural Specs

ParameterizedSearch Engine

LoopToolSource Binary

User

Tuning Level

Search Space

Page 20: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

AutoTuning Workshop 06 Rice University 20

Why Direct Search?Why Direct Search?

• Search decision based solely on function evaluations– No modeling of the search space required

• Provides approximate solutions at each stage of the calculation– Can stop the search at any point when constrained

by tuning time

• Flexible– Can tune step sizes in different dimensions

• Parallelizable• Relatively easy to implement

Page 21: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

21

LA: do j = 1, N do i = 1, M b(i,j) = a(i,j) + a(i,j-1) enddo enddo

LB: do j = 1, N do i = 1, M c(i,j) = b(i,j) + d(j) enddo enddo

outer loop reuse of a()

cross-loopreuse of b()

(a) code before transformations

inner loop reuse of d()

Page 22: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

22

LAB: do j = 1, N do i = 1, M b(i,j) = a(i,j) + a(i,j-1) c(i,j) = b(i,j) + d(j) enddo enddo

lost reuse of a()

saved loads of b()

(b) code after two-level fusion

increased potential for conflict misses

Page 23: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

23

do i = 1, M, T do j = 1, N do ii = i, i + T - 1 b(ii,j) = a(ii,j)+ a(ii,j-1) c(ii,j) = b(ii,j) + d(j) enddo enddoenddo

reduced reuse of d()

How do we pick T?

Not too difficult if caches are fully associative

Can use models to estimate effective cache size for set-associative caches

Model unlikely to be totally accurate - Need a way to correct for inaccuracies

regained reuse of a()

Page 24: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

24

Register Set

T01 … … … T0N

T11 … … … T1N

T21 … … … T2N

L1 Cache

Tile Size

Fusion Config.

(L +

1)

dim

en

sio

ns

(N- p

)L x

(2L-q

) poin

ts

Cost Models

F0 … … … F2L

T31 … … … T3N

Page 25: Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX

25

Register Set

T01 … … … T0N

T11 … … … T1N

T21 … … … T2N

L1 Cache

Tile Size

Fusion Config.

(L +

1)

dim

en

sio

ns

(N- p

)L x

(2L-q

) poin

ts

2 dimensions

F0 … … … F2L

T31 … … … T3N

Effective Register

Set

Effective Cache

Capacity

Cost Model

Estimates of machine parameters