advanced analysis in suif2 and future work monica lam stanford university

29
Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University http://suif.stanford.edu/

Post on 20-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Advanced Analysis in SUIF2 and Future Work

Monica LamStanford Universityhttp://suif.stanford.edu/

Page 2: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Interprocedural, High-Level Transforms for Locality and Parallelism

Program transformation for computational kernels a new technique based on affine partitioning

Interprocedural analysis framework: to maximize code reuse Flow sensitivity; context sensitivity

Interprocedural program analysis Pointer alias analysis (Steensgaard’s algorithm) Scalar/Scalar dependence, privatization, reduction recognition

Parallel code generation Define new IR nodes for parallel code

Page 3: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Loop Transforms: Cholesky factorization example

DO 1 J = 0, NI0 = MAX ( -M, -J )

DO 2 I = I0, -1DO 3 JJ = I0 - I, -1

DO 3 L = 0, NMAT3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J)

DO 2 L = 0, NMAT2 A(L,I,J) = A(L,I,J) * A(L,0,I+J)

DO 4 L = 0, NMAT4 EPSS(L) = EPS * A(L,0,J)

DO 5 JJ = I0, -1DO 5 L = 0, NMAT

5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2DO 1 L = 0, NMAT

1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) )

DO 6 I = 0, NRHSDO 7 K = 0, N

DO 8 L = 0, NMAT8 B(I,L,K) = B(I,L,K) * A(L,0,K)

DO 7 JJ = 1, MIN (M, N-K)DO 7 L = 0, NMAT

7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K)DO 6 K = N, 0, -1

DO 9 L = 0, NMAT9 B(I,L,K) = B(I,L,K) * A(L,0,K)

DO 6 JJ = 1, MIN (M, K)DO 6 L = 0, NMAT

6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

Page 4: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Results for Optimizing Perfect Nests

Speedup on a Digital Turbolaser with 8 300Mhz 21164 processors

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8Number of Processors

Sp

eed

up

Unimodular + Blocking

Decomposition + Barrier Elimination

Page 5: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Optimizing Arbitrary Loop Nesting Using Affine Partitions

DO 1 J = 0, NI0 = MAX ( -M, -J )

DO 2 I = I0, -1DO 3 JJ = I0 - I, -1

DO 3 L = 0, NMAT3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J)

DO 2 L = 0, NMAT2 A(L,I,J) = A(L,I,J) * A(L,0,I+J)

DO 4 L = 0, NMAT4 EPSS(L) = EPS * A(L,0,J)

DO 5 JJ = I0, -1DO 5 L = 0, NMAT

5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2DO 1 L = 0, NMAT

1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) )

DO 6 I = 0, NRHSDO 7 K = 0, N

DO 8 L = 0, NMAT8 B(I,L,K) = B(I,L,K) * A(L,0,K)

DO 7 JJ = 1, MIN (M, N-K)DO 7 L = 0, NMAT

7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K)DO 6 K = N, 0, -1

DO 9 L = 0, NMAT9 B(I,L,K) = B(I,L,K) * A(L,0,K)

DO 6 JJ = 1, MIN (M, K)DO 6 L = 0, NMAT

6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

A

B

EPSS

L

L

L

Page 6: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Results with Affine Partitioning + Blocking

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8Number of Processors

Sp

eed

up

Unimodular + BlockingDecomposition + Barrier EliminationAffine Partitioning

Page 7: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

New Transform Theory

Domain: arbitrary loop nesting, instruction optimized separately Unifies

Permutation Skewing Reversal Fusion Fission Statement reordering

Supports blocking across all loop nests Optimal: Max. deg. of parallelism & min. deg. of

synchronization Minimize communication by aligning the computation and pipelining More powerful & simpler software engineering

Page 8: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

A Simple Example

FOR i = 1 TO n DO

FOR j = 1 TO n DO

A[i,j] = A[i,j]+B[i-1,j]; (S1)

B[i,j] = A[i,j-1]*B[i,j]; (S2)

i

j

S1

S2

Page 9: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Best Parallelization Scheme

SPMD code: Let p be the processor’s ID number

if (1-n <= p <= n) then

if (1 <= p) then

B[p,1] = A[p,0] * B[p,1]; (S2)

for i1 = max(1,1+p) to min(n,n-1+p) do

A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; (S1)

B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; (S2)

if (p <= 0) then

A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S1)

Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j.S2: Execute iteration (i, j) on processor i-j+1.

Page 10: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Maximum Parallelism & No Communication

Let Fxj be an access to array x in statement j,

ij be an iteration index for statement j,

Bj ij 0 represent loop bound constraints for statement j,

Find Cj which maps an instance of statement j to a processor:

ij, ik Bj ij 0, Bk ik 0

Fxj (ij) = Fxk (ik) Cj (ij) = Ck (ik)

with the objective of maximizing the rank of Cj

Loops Array

Processor ID

F1 (i1)

F2 (i2)

C1 (i1)

C2 (i2)

Page 11: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Algorithm

ij, ik Bj ij 0, Bk ik 0

Fxj (ij) = Fxk (ik) Cj (ij) = Ck (ik)

Rewrite partition constraints as systems of linear equations

use affine form of Farkas Lemma to rewrite constraints assystems of linear inequalities in C and

use Fourier-Motzkin algorithm to eliminate Farkas multipliers and get systems of linear equations AC =0

Find solutions using linear algebra techniques

the null space for matrix A is a solution of C with maximum rank.

Page 12: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

PipeliningAlternating Direction Integration Example

Requires transposing dataDO J = 1 to N (parallel)

DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J)

DO J = 1 to NDO I = 1 to N (parallel)

A(I,J) = g(A(I,J),A(I,J-1))

Moves only boundary dataDO J = 1 to N (parallel)

DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J)

DO J = 1 to N (pipelined)DO I = 1 to N

A(I,J) = g(A(I,J),A(I,J-1))

Page 13: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Finding the Maximum Degree of Pipelining

Let Fxj be an access to array x in statement j,

ij be an iteration index for statement j,

Bj ij 0 represent loop bound constraints for statement j,

Find Tj which maps an instance of statement j to a time stage:

ij, ik Bj ij 0, Bk ik 0

( ij ik ) (Fxj ( ij) = Fxk ( ik)) Tj (ij) Tk (ik)

lexicographically

with the objective of maximizing the rank of Tj

Loops Array

Time Stage

F1 (i1)

F2 (i2)T1 (i1)

T2 (i2)

Page 14: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Key Insight

Choice in time mapping => (pipelined) parallelism Degrees of parallelism = rank(T) - 1

Page 15: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Putting it All Together

Find maximum outer-loop parallelism with minimum synchronization Divide into strongly connected components Apply processor mapping algorithm (no communication) to

program If no parallelism found,

Apply time mapping algorithm to find pipelining If no pipelining found (found outer sequential loop)

• Repeat process on inner loops Minimize communication

Use a greedy method to order communicating pairs Try to find communication-free, or neighborhood only

communication by solving similar equations Aggregate computations of consecutive data to improve spatial

locality

Page 16: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Current Status

Completed: Mathematics package

Integrated Omega: Pugh’s presburger arithmetic Linear algebra package

• Farkas lemma, gaussian elimination, finding null space Can find communication-free partitions

In progress Rest of affine partitioning Code generation

Page 17: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Interprocedural Analysis

Two major design choices in program analysis Across procedures

No interprocedural analysis Interprocedural: context-insensitive Interprocedural: context-sensitive

Within a procedure Flow-insensitive Flow-sensitive: interval/region based Flow-sensitive: iterative over flow-graph Flow-sensitive: SSA based

Page 18: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Efficient Context-Sensitive Analysis

Bottom-up A region/interval: a procedure or a loop An edge: call or code in inner scope Summarize each region (with a transfer

function) Find strongly connected components (sccs) Bottom-up traversal of sccs Iteration to find fixed-point for recursive

functions

Top-down Top-down propagation of values Iteration to find fixed-point for recursive

functions

call

inner loop

scc

Page 19: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Interprocedural Framework Architecture

E.g. Array summaries

E.g. Pointer aliases

Primitive Handlers Procedure calls and returns

Regions & StatementsBasic blocks

Compound Handlers

Bottom-upTop-down

Linear traversal

Driver

Call graph, SCC

Regions, control flow graphs

Data Structures

Page 20: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Interprocedural Framework Architecture

Interprocedural analysis data structures e.g. call graphs, SSA form, regions or intervals

Handlers: Orthogonal sets of handlers for different groups of constructs Primitive: user specifies analysis-specific semantics of primitives Compound: handles compound statements and calls

User chooses between handlers of different strengths• e.g. no interprocedural analysis versus context-sensitive• e.g. flow-insensitive vs. flow-sensitive cfg

All the handlers are registered in a visitor Driver

Driver invoked by user’s request for information (demand driven) Build prepass data structures Invokes the right set of handlers in right order

(e.g. bottom-up traversal of call graph)

Page 21: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Pointer Alias Analysis

Steensgaard’s pointer alias analysis (completed) Flow-insensitive and context-insensitive, type-inference based

analysis Very efficient: near linear-time analysis Very inaccurate

Page 22: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Parallelization Analysis

Scalar analysis Mod/ref, reduction recognition: Bottom-up flow-insensitive Liveness for privatization: Bottom-up and top-down, flow-

sensitive Region-based array analysis

May-write, must-write, read, upwards-exposed read: bottom-up

Array liveness for privatization: bottom-up and top-down Uses our interprocedural framework + omega

Symbolic analysis Find linear relationships between scalar variables to improve

array analysis

Page 23: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Parallel Code Generation

Loop bound generation Use omega based on affine mappings

Outlining and cloning primitives

Special IR nodes to represent parallelization primitives Allows a succint and high-level description of

parallelization decision For communication to and from users Reduction and private variables and

primitives Synchronization and parallelization primitives

SUIF+par IR

SUIF

SUIF

Page 24: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Status

Completed Call graphs, scc Steensgaard’s pointer alias analysis Integration of garbage collector with SUIF

In progress Interprocedural analysis framework Array summaries Scalar dependence analysis Parallel code generation

To be done Scalar symbolic analysis

Page 25: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Future work: Basic compiler research

A flexible and integrated platform for new optimizations

Combinations of pointers, OO, parallelization optimizations to parallelize or SIMDize (MMX) multimedia applications

Interaction between garbage collection, exception handling with back end optimizations

Embedded compilers with application-specific additions at the source language and architectural level

Page 26: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

As a Useful Compiler for High-Performance Computers

Basic ingredients of a state-of-the-art parallelizing compiler Requires experimentation, tuning, refinement

First implementation of affine partitioning Interprocedural parallelization requires many analyses working

together Missing functions

Automatic data distribution User interaction needed for parallelizing large code region

SUIF Explorer - a prototype interactive parallelizer in SUIF1 Requires tools: algorithm to guide performance tuning,

program slices, visualization tools New techniques

Extend affine mapping to sparse codes (with permutation index arrays) Fortran 90 front end Debugging support

Page 27: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

New-Generation Productivity Tool

Apply high-level program analysis to increase programmers’ productivity

Many existing analyses High-level, interprocedural side effect analysis with pointers and

arrays New analyses

Flow and context sensitive pointer alias analysis Interprocedural control-path based analysis

Examples of tools Find bugs in program Prove or disapprove user invariants Generate test cases Interactive demand-driven analysis to aid in program debugging Can also apply to Verilog/VHDL to improve hardware verification

Page 28: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

Finally ...

The system has to be actively maintained and supported to be useful.

Page 29: Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University

The End