spiral: an empirical search system for program generation and optimization david padua department of...

42
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana- Champaign

Upload: joel-eustace-holmes

Post on 02-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Spiral: an empirical search system for

program generation and optimization

David PaduaDepartment of Computer

ScienceUniversity of Illinois at Urbana-

Champaign 

2

Program optimization today

• The optimization phase of a compiler applies a series of transformations to achieve its objectives.

• The compiler uses the outcome of program analysis to determine which transformations are correctness-preserving.

• Compiler transformation and analysis techniques are reasonably well-understood.

• Since many of the compiler optimization problems have “exponential complexity”, heuristics are needed to drive the application of transformations.

3

Optimization drivers

• Developing driving heuristics is laborious.

• One reason for this is the lack of methodologies and tools to build optimization drivers.

• As a result, although there is much in common among compilers, their optimization phases are usually re-implemented from scratch.

4

Optimization drivers (Cont.)

• A consequence: Machines and languages not widely popular usually lack good compilers. (some popular systems too)– DSP, network processor, and embedded system

programming is often done in assembly language.

– Evaluation of new architectural features requiring compiler involvement is not always meaningful.

– Languages such as APL, MATLAB, LISP, … suffer from chronic low performance.

– New languages difficult to introduce (although compilers are only a part of the problem).

5

A methodology based on the notion of search space

• Program transformations often have several possible target versions.– Loop unrolling: How many times– Loop tiling: size of the tile.– Loop interchanging: order of loop headers– Register allocation: which registers are stored

in memory to give room for new values.

• The process of optimization can be seen as a search in the space of possible program versions.

6

Empirical searchIterative compilation

• Perhaps the simplest application of the search space model is empirical search where several versions are generated and executed on the target machine. The fastest version is selected.

T. Kisuki, P.M.W. Knijnenburg, M.F.P. O'Boyle, and H.A.G. Wijshoff . Iterative compilation in program optimization. In Proc. CPC2000, pages 35-44, 2000

7

Empirical search and traditional compilers

• Searching is not a new approach and compilers have applied it in the past, but using architectural prediction models instead of actual runs:– KAP searched for best loop header

order– SGI’s MIPS-pro and IBM PowerPC

compilers select the best degree of unrolling.

8

Limitations of empirical search

• Empirical search is conceptually simple and portable.

• However, – the search space tends to be too large specially

when several transformations are combined.– It is not clear how to apply this method when

program behavior is a function of the input data set.

• Need heuristics/search strategies.• Availability of performance “formulas” could

help evaluate transformations across input data sets and facilitate search.

9

Compilers and Library Generators

Source Program

Internal representation

Algorithm

Program Transformation

Program Generation

10

Empirical search in program/library

generators

• Examples:– FFTW [M. Frigo, S. Johnson]– Spiral (FFT/signal processing) [J. Moura (CMU),

M. Veloso (CMU), J. Johnson (Drexel), …]– ATLAS (linear algebra)(R. Whaley, A. Petitet, J.

Dongarra)– PHiPAC[J. Demmel et al]

11

12

SPIRAL

• The approach:– Mathematical formulation of signal

processing algorithms– Automatically generate algorithm versions– A generalization of the well-known FFTW– Use compiler technique to translate

formulas into implementations– Adapt to the target platform by searching

for the optimal version

13

14

Fast DSP Algorithms As Matrix Factorizations

• Computing y = F4 x is carried out as:

t1 = A4 x ( permutation )

t2 = A3 t1 ( two F2’s )

t3 = A2 t2 ( diagonal scaling )

y = A1 t3 ( two F2’s )• The cost is reduced because A1, A2, A3

and A4 are structured sparse matrices.

15

Tensor Product Formulation of Cooley-

TuckeyTheorem

Example

rsrsr

rsssrrs LFITIFF )()(

is a diagonal matrixis a stride permutation

rssT

rsrL

1000

0010

0100

0001

1100

1100

0011

0011

1000

0100

0010

0001

1010

0101

1010

0101

)()( 4222

44224 LFITIFF

16

Formulas for Matrix Factorizations

4222

42224 )LF(I)TI(FF

rsrsr

rsssrrs )LF(I)TI(FF

R1

1

ki

nnnn

k

1i

nnnnnnnn )L(I)T)(IIF(IF ii

ii

ii

iiiii

where n = n1…nk, ni- = n1…ni-1, ni+= ni+1…nk

R2

17

Factorization Trees

F2

F2 F2

F8 : R1

F4 : R1F2

F2 F2

F8 : R1

F4 : R1

F2 F2 F2

F8 : R2

Different computation orderDifferent data access

patternDifferent performance

18

Walsh-Hadamard Transform

19

Optimal Factorization Trees

• Depend on the platform• Difficult to deduct• Can be found by empirical search

– The search space is very large– Different search algorithms

• Random, DP, GA, hill-climbing, exhaustive

20

21

22

Size of Search Space

N # of formulas N # of formulas

21 1 29 20793

22 1 210 103049

23 3 211 518859

24 11 212 2646723

25 45 213 13649969

26 197 214 71039373

27 903 215 372693519

28 4279 216 1968801519

23

24

25

More Search Choices

• Programming:– Loop unrolling– Memory allocation– In-lining

• Platform choices:– Compiler optimization options

26

The SPIRAL System

Formula Generator

SPL Compiler

Performance Evaluation

Search Engine

DSP Transform

Target machine DSP Library

SPL Program

C/FORTRAN Programs

27

Spiral

• Spiral does the factorization at installation time and generates one library routine for each size.

• FFTW only generates codelets (input size 64) and at run time performs the factorization.

28

A Simple SPL Program

Definition DirectiveFormula Comment

; This is a simple SPL program(define A (matrix(1 2)(2 1)))(define B (diagonal(3 3))#subname simple(tensor (I 2)(compose A B));; This is an invisible comment

29

Templates

(template (F n)[ n >= 1 ] ( do i=0,n-1 y(i)=0 do j=0,n-1 y(i)=y(i)+W(n,i*j)*x(j) end end ))

Pattern

I-code

Condition

30

SPL Compiler

Parsing

Intermediate Code Generation

Intermediate Code Restructuring

Target Code Generation

Abstract Syntax Tree

I-Code

I-Code

FORTRAN, C

Template Table

SPL Formula Template Definition

OptimizationI-Code

31

Intermediate Code Restructuring

• Loop unrolling– Degree of unrolling can be controlled globally

or case by case

• Scalar function evaluation– Replace scalar functions with constant value

or array access

• Type conversion– Type of input data: real or complex– Type of arithmetic: real or complex– Same SPL formula, different C/Fortran

programs

32

33

Optimizations

SPL Compiler

C/Fortran Compiler

Formula Generator* High-level scheduling* Loop transformation

* High-level optimizations- Constant folding- Copy propagation- CSE- Dead code elimination

* Low-level optimizations- Instruction scheduling- Register allocation

34

Basic Optimizations (FFT, N=25, SPARC, f77 –fast –O5)

35

Basic Optimizations(FFT, N=25, MIPS, f77 –O3)

36

Basic Optimizations(FFT, N=25, PII, g77 –O6 –malign-double)

37

Performance Evaluation

• Evaluation the performance of the code generated by the SPL compiler

• Platforms: SPARC, MIPS, PII• Search strategy: dynamic

programming

38

Pseudo MFlops

• Estimation of the # of FP operations:– FFT (radix-2): 5nlog2n – 10 + 16

s)( timeExecution

algorithm in the operations FP of #MFlops Pseudo

39

FFT Performance (N=21 to 26)

SPARC MIPS

PII

40

FFT Performance (N=27 to 220)

SPARC MIPS

PII

41

Important Questions

• What lessons can be learned from this work?

• Can this approach be used in other domains ?

42