advanced analysis in suif2 and future work monica lam stanford university

Advanced Analysis in SUIF2 and Future Work

Monica LamStanford Universityhttp://suif.stanford.edu/

Interprocedural, High-Level Transforms for Locality and Parallelism

Program transformation for computational kernels a new technique based on affine partitioning

Interprocedural analysis framework: to maximize code reuse Flow sensitivity; context sensitivity

Interprocedural program analysis Pointer alias analysis (Steensgaard’s algorithm) Scalar/Scalar dependence, privatization, reduction recognition

Parallel code generation Define new IR nodes for parallel code

Loop Transforms: Cholesky factorization example

DO 1 J = 0, NI0 = MAX ( -M, -J )

DO 2 I = I0, -1DO 3 JJ = I0 - I, -1

DO 3 L = 0, NMAT3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J)

DO 2 L = 0, NMAT2 A(L,I,J) = A(L,I,J) * A(L,0,I+J)

DO 4 L = 0, NMAT4 EPSS(L) = EPS * A(L,0,J)

DO 5 JJ = I0, -1DO 5 L = 0, NMAT

5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2DO 1 L = 0, NMAT

1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) )

DO 6 I = 0, NRHSDO 7 K = 0, N

DO 8 L = 0, NMAT8 B(I,L,K) = B(I,L,K) * A(L,0,K)

DO 7 JJ = 1, MIN (M, N-K)DO 7 L = 0, NMAT

7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K)DO 6 K = N, 0, -1


DO 6 JJ = 1, MIN (M, K)DO 6 L = 0, NMAT

6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

Results for Optimizing Perfect Nests

Speedup on a Digital Turbolaser with 8 300Mhz 21164 processors

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8Number of Processors

Sp

eed

up

Unimodular + Blocking

Decomposition + Barrier Elimination

Optimizing Arbitrary Loop Nesting Using Affine Partitions

DO 1 J = 0, NI0 = MAX ( -M, -J )

DO 2 I = I0, -1DO 3 JJ = I0 - I, -1

DO 3 L = 0, NMAT3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J)

DO 2 L = 0, NMAT2 A(L,I,J) = A(L,I,J) * A(L,0,I+J)

DO 4 L = 0, NMAT4 EPSS(L) = EPS * A(L,0,J)

DO 5 JJ = I0, -1DO 5 L = 0, NMAT

5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2DO 1 L = 0, NMAT

1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) )

DO 6 I = 0, NRHSDO 7 K = 0, N


DO 7 JJ = 1, MIN (M, N-K)DO 7 L = 0, NMAT

7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K)DO 6 K = N, 0, -1


DO 6 JJ = 1, MIN (M, K)DO 6 L = 0, NMAT

6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

A

B

EPSS

L

L

L

Results with Affine Partitioning + Blocking

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8Number of Processors

Sp

eed

up

Unimodular + BlockingDecomposition + Barrier EliminationAffine Partitioning

New Transform Theory

Domain: arbitrary loop nesting, instruction optimized separately Unifies

Permutation Skewing Reversal Fusion Fission Statement reordering

Supports blocking across all loop nests Optimal: Max. deg. of parallelism & min. deg. of

synchronization Minimize communication by aligning the computation and pipelining More powerful & simpler software engineering

A Simple Example

FOR i = 1 TO n DO

FOR j = 1 TO n DO

A[i,j] = A[i,j]+B[i-1,j]; (S1)

B[i,j] = A[i,j-1]*B[i,j]; (S2)

i

j

S1

S2

Best Parallelization Scheme

SPMD code: Let p be the processor’s ID number

if (1-n <= p <= n) then

if (1 <= p) then

B[p,1] = A[p,0] * B[p,1]; (S2)

for i1 = max(1,1+p) to min(n,n-1+p) do

A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; (S1)

B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; (S2)

if (p <= 0) then

A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S1)

Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j.S2: Execute iteration (i, j) on processor i-j+1.

Maximum Parallelism & No Communication

Let Fxj be an access to array x in statement j,

ij be an iteration index for statement j,

Bj ij 0 represent loop bound constraints for statement j,

Find Cj which maps an instance of statement j to a processor:

ij, ik Bj ij 0, Bk ik 0

Fxj (ij) = Fxk (ik) Cj (ij) = Ck (ik)

with the objective of maximizing the rank of Cj

Loops Array

Processor ID

F1 (i1)

F2 (i2)

C1 (i1)

C2 (i2)

Algorithm


Fxj (ij) = Fxk (ik) Cj (ij) = Ck (ik)

Rewrite partition constraints as systems of linear equations

use affine form of Farkas Lemma to rewrite constraints assystems of linear inequalities in C and

use Fourier-Motzkin algorithm to eliminate Farkas multipliers and get systems of linear equations AC =0

Find solutions using linear algebra techniques

the null space for matrix A is a solution of C with maximum rank.

PipeliningAlternating Direction Integration Example

Requires transposing dataDO J = 1 to N (parallel)

DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J)

DO J = 1 to NDO I = 1 to N (parallel)

A(I,J) = g(A(I,J),A(I,J-1))

Moves only boundary dataDO J = 1 to N (parallel)

DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J)

DO J = 1 to N (pipelined)DO I = 1 to N

A(I,J) = g(A(I,J),A(I,J-1))

Finding the Maximum Degree of Pipelining

Let Fxj be an access to array x in statement j,

ij be an iteration index for statement j,

Bj ij 0 represent loop bound constraints for statement j,

Find Tj which maps an instance of statement j to a time stage:


( ij ik ) (Fxj ( ij) = Fxk ( ik)) Tj (ij) Tk (ik)

lexicographically

with the objective of maximizing the rank of Tj

Loops Array

Time Stage

F1 (i1)

F2 (i2)T1 (i1)

T2 (i2)

Key Insight

Choice in time mapping => (pipelined) parallelism Degrees of parallelism = rank(T) - 1

Putting it All Together

Find maximum outer-loop parallelism with minimum synchronization Divide into strongly connected components Apply processor mapping algorithm (no communication) to

program If no parallelism found,

Apply time mapping algorithm to find pipelining If no pipelining found (found outer sequential loop)

• Repeat process on inner loops Minimize communication

Use a greedy method to order communicating pairs Try to find communication-free, or neighborhood only

communication by solving similar equations Aggregate computations of consecutive data to improve spatial

locality

Current Status

Completed: Mathematics package

Integrated Omega: Pugh’s presburger arithmetic Linear algebra package

• Farkas lemma, gaussian elimination, finding null space Can find communication-free partitions

In progress Rest of affine partitioning Code generation

Interprocedural Analysis

Two major design choices in program analysis Across procedures

No interprocedural analysis Interprocedural: context-insensitive Interprocedural: context-sensitive

Within a procedure Flow-insensitive Flow-sensitive: interval/region based Flow-sensitive: iterative over flow-graph Flow-sensitive: SSA based

Efficient Context-Sensitive Analysis

Bottom-up A region/interval: a procedure or a loop An edge: call or code in inner scope Summarize each region (with a transfer

function) Find strongly connected components (sccs) Bottom-up traversal of sccs Iteration to find fixed-point for recursive

functions

Top-down Top-down propagation of values Iteration to find fixed-point for recursive

functions

call

inner loop

scc

Interprocedural Framework Architecture

E.g. Array summaries

E.g. Pointer aliases

Primitive Handlers Procedure calls and returns

Regions & StatementsBasic blocks

Compound Handlers

Bottom-upTop-down

Linear traversal

Driver

Call graph, SCC

Regions, control flow graphs

Data Structures

Interprocedural Framework Architecture

Interprocedural analysis data structures e.g. call graphs, SSA form, regions or intervals

Handlers: Orthogonal sets of handlers for different groups of constructs Primitive: user specifies analysis-specific semantics of primitives Compound: handles compound statements and calls

User chooses between handlers of different strengths• e.g. no interprocedural analysis versus context-sensitive• e.g. flow-insensitive vs. flow-sensitive cfg

All the handlers are registered in a visitor Driver

Driver invoked by user’s request for information (demand driven) Build prepass data structures Invokes the right set of handlers in right order

(e.g. bottom-up traversal of call graph)

Pointer Alias Analysis

Steensgaard’s pointer alias analysis (completed) Flow-insensitive and context-insensitive, type-inference based

analysis Very efficient: near linear-time analysis Very inaccurate

Parallelization Analysis

Scalar analysis Mod/ref, reduction recognition: Bottom-up flow-insensitive Liveness for privatization: Bottom-up and top-down, flow-

sensitive Region-based array analysis

May-write, must-write, read, upwards-exposed read: bottom-up

Array liveness for privatization: bottom-up and top-down Uses our interprocedural framework + omega

Symbolic analysis Find linear relationships between scalar variables to improve

array analysis

Parallel Code Generation

Loop bound generation Use omega based on affine mappings

Outlining and cloning primitives

Special IR nodes to represent parallelization primitives Allows a succint and high-level description of

parallelization decision For communication to and from users Reduction and private variables and

primitives Synchronization and parallelization primitives

SUIF+par IR

SUIF

SUIF

Status

Completed Call graphs, scc Steensgaard’s pointer alias analysis Integration of garbage collector with SUIF

In progress Interprocedural analysis framework Array summaries Scalar dependence analysis Parallel code generation

To be done Scalar symbolic analysis

Future work: Basic compiler research

A flexible and integrated platform for new optimizations

Combinations of pointers, OO, parallelization optimizations to parallelize or SIMDize (MMX) multimedia applications

Interaction between garbage collection, exception handling with back end optimizations

Embedded compilers with application-specific additions at the source language and architectural level

As a Useful Compiler for High-Performance Computers

Basic ingredients of a state-of-the-art parallelizing compiler Requires experimentation, tuning, refinement

First implementation of affine partitioning Interprocedural parallelization requires many analyses working

together Missing functions

Automatic data distribution User interaction needed for parallelizing large code region

SUIF Explorer - a prototype interactive parallelizer in SUIF1 Requires tools: algorithm to guide performance tuning,

program slices, visualization tools New techniques

Extend affine mapping to sparse codes (with permutation index arrays) Fortran 90 front end Debugging support

New-Generation Productivity Tool

Apply high-level program analysis to increase programmers’ productivity

Many existing analyses High-level, interprocedural side effect analysis with pointers and

arrays New analyses

Flow and context sensitive pointer alias analysis Interprocedural control-path based analysis

Examples of tools Find bugs in program Prove or disapprove user invariants Generate test cases Interactive demand-driven analysis to aid in program debugging Can also apply to Verilog/VHDL to improve hardware verification

Finally ...

The system has to be actively maintained and supported to be useful.

The End

advanced analysis in suif2 and future work monica lam stanford university

Documents