advanced analysis in suif2 and future work monica lam stanford university
Post on 20-Dec-2015
219 views
TRANSCRIPT
Advanced Analysis in SUIF2 and Future Work
Monica LamStanford Universityhttp://suif.stanford.edu/
Interprocedural, High-Level Transforms for Locality and Parallelism
Program transformation for computational kernels a new technique based on affine partitioning
Interprocedural analysis framework: to maximize code reuse Flow sensitivity; context sensitivity
Interprocedural program analysis Pointer alias analysis (Steensgaard’s algorithm) Scalar/Scalar dependence, privatization, reduction recognition
Parallel code generation Define new IR nodes for parallel code
Loop Transforms: Cholesky factorization example
DO 1 J = 0, NI0 = MAX ( -M, -J )
DO 2 I = I0, -1DO 3 JJ = I0 - I, -1
DO 3 L = 0, NMAT3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J)
DO 2 L = 0, NMAT2 A(L,I,J) = A(L,I,J) * A(L,0,I+J)
DO 4 L = 0, NMAT4 EPSS(L) = EPS * A(L,0,J)
DO 5 JJ = I0, -1DO 5 L = 0, NMAT
5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2DO 1 L = 0, NMAT
1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) )
DO 6 I = 0, NRHSDO 7 K = 0, N
DO 8 L = 0, NMAT8 B(I,L,K) = B(I,L,K) * A(L,0,K)
DO 7 JJ = 1, MIN (M, N-K)DO 7 L = 0, NMAT
7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K)DO 6 K = N, 0, -1
DO 9 L = 0, NMAT9 B(I,L,K) = B(I,L,K) * A(L,0,K)
DO 6 JJ = 1, MIN (M, K)DO 6 L = 0, NMAT
6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)
Results for Optimizing Perfect Nests
Speedup on a Digital Turbolaser with 8 300Mhz 21164 processors
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8Number of Processors
Sp
eed
up
Unimodular + Blocking
Decomposition + Barrier Elimination
Optimizing Arbitrary Loop Nesting Using Affine Partitions
DO 1 J = 0, NI0 = MAX ( -M, -J )
DO 2 I = I0, -1DO 3 JJ = I0 - I, -1
DO 3 L = 0, NMAT3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J)
DO 2 L = 0, NMAT2 A(L,I,J) = A(L,I,J) * A(L,0,I+J)
DO 4 L = 0, NMAT4 EPSS(L) = EPS * A(L,0,J)
DO 5 JJ = I0, -1DO 5 L = 0, NMAT
5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2DO 1 L = 0, NMAT
1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) )
DO 6 I = 0, NRHSDO 7 K = 0, N
DO 8 L = 0, NMAT8 B(I,L,K) = B(I,L,K) * A(L,0,K)
DO 7 JJ = 1, MIN (M, N-K)DO 7 L = 0, NMAT
7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K)DO 6 K = N, 0, -1
DO 9 L = 0, NMAT9 B(I,L,K) = B(I,L,K) * A(L,0,K)
DO 6 JJ = 1, MIN (M, K)DO 6 L = 0, NMAT
6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)
A
B
EPSS
L
L
L
Results with Affine Partitioning + Blocking
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8Number of Processors
Sp
eed
up
Unimodular + BlockingDecomposition + Barrier EliminationAffine Partitioning
New Transform Theory
Domain: arbitrary loop nesting, instruction optimized separately Unifies
Permutation Skewing Reversal Fusion Fission Statement reordering
Supports blocking across all loop nests Optimal: Max. deg. of parallelism & min. deg. of
synchronization Minimize communication by aligning the computation and pipelining More powerful & simpler software engineering
A Simple Example
FOR i = 1 TO n DO
FOR j = 1 TO n DO
A[i,j] = A[i,j]+B[i-1,j]; (S1)
B[i,j] = A[i,j-1]*B[i,j]; (S2)
i
j
S1
S2
Best Parallelization Scheme
SPMD code: Let p be the processor’s ID number
if (1-n <= p <= n) then
if (1 <= p) then
B[p,1] = A[p,0] * B[p,1]; (S2)
for i1 = max(1,1+p) to min(n,n-1+p) do
A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; (S1)
B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; (S2)
if (p <= 0) then
A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S1)
Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j.S2: Execute iteration (i, j) on processor i-j+1.
Maximum Parallelism & No Communication
Let Fxj be an access to array x in statement j,
ij be an iteration index for statement j,
Bj ij 0 represent loop bound constraints for statement j,
Find Cj which maps an instance of statement j to a processor:
ij, ik Bj ij 0, Bk ik 0
Fxj (ij) = Fxk (ik) Cj (ij) = Ck (ik)
with the objective of maximizing the rank of Cj
Loops Array
Processor ID
F1 (i1)
F2 (i2)
C1 (i1)
C2 (i2)
Algorithm
ij, ik Bj ij 0, Bk ik 0
Fxj (ij) = Fxk (ik) Cj (ij) = Ck (ik)
Rewrite partition constraints as systems of linear equations
use affine form of Farkas Lemma to rewrite constraints assystems of linear inequalities in C and
use Fourier-Motzkin algorithm to eliminate Farkas multipliers and get systems of linear equations AC =0
Find solutions using linear algebra techniques
the null space for matrix A is a solution of C with maximum rank.
PipeliningAlternating Direction Integration Example
Requires transposing dataDO J = 1 to N (parallel)
DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J)
DO J = 1 to NDO I = 1 to N (parallel)
A(I,J) = g(A(I,J),A(I,J-1))
Moves only boundary dataDO J = 1 to N (parallel)
DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J)
DO J = 1 to N (pipelined)DO I = 1 to N
A(I,J) = g(A(I,J),A(I,J-1))
Finding the Maximum Degree of Pipelining
Let Fxj be an access to array x in statement j,
ij be an iteration index for statement j,
Bj ij 0 represent loop bound constraints for statement j,
Find Tj which maps an instance of statement j to a time stage:
ij, ik Bj ij 0, Bk ik 0
( ij ik ) (Fxj ( ij) = Fxk ( ik)) Tj (ij) Tk (ik)
lexicographically
with the objective of maximizing the rank of Tj
Loops Array
Time Stage
F1 (i1)
F2 (i2)T1 (i1)
T2 (i2)
Key Insight
Choice in time mapping => (pipelined) parallelism Degrees of parallelism = rank(T) - 1
Putting it All Together
Find maximum outer-loop parallelism with minimum synchronization Divide into strongly connected components Apply processor mapping algorithm (no communication) to
program If no parallelism found,
Apply time mapping algorithm to find pipelining If no pipelining found (found outer sequential loop)
• Repeat process on inner loops Minimize communication
Use a greedy method to order communicating pairs Try to find communication-free, or neighborhood only
communication by solving similar equations Aggregate computations of consecutive data to improve spatial
locality
Current Status
Completed: Mathematics package
Integrated Omega: Pugh’s presburger arithmetic Linear algebra package
• Farkas lemma, gaussian elimination, finding null space Can find communication-free partitions
In progress Rest of affine partitioning Code generation
Interprocedural Analysis
Two major design choices in program analysis Across procedures
No interprocedural analysis Interprocedural: context-insensitive Interprocedural: context-sensitive
Within a procedure Flow-insensitive Flow-sensitive: interval/region based Flow-sensitive: iterative over flow-graph Flow-sensitive: SSA based
Efficient Context-Sensitive Analysis
Bottom-up A region/interval: a procedure or a loop An edge: call or code in inner scope Summarize each region (with a transfer
function) Find strongly connected components (sccs) Bottom-up traversal of sccs Iteration to find fixed-point for recursive
functions
Top-down Top-down propagation of values Iteration to find fixed-point for recursive
functions
call
inner loop
scc
Interprocedural Framework Architecture
E.g. Array summaries
E.g. Pointer aliases
Primitive Handlers Procedure calls and returns
Regions & StatementsBasic blocks
Compound Handlers
Bottom-upTop-down
Linear traversal
Driver
Call graph, SCC
Regions, control flow graphs
Data Structures
Interprocedural Framework Architecture
Interprocedural analysis data structures e.g. call graphs, SSA form, regions or intervals
Handlers: Orthogonal sets of handlers for different groups of constructs Primitive: user specifies analysis-specific semantics of primitives Compound: handles compound statements and calls
User chooses between handlers of different strengths• e.g. no interprocedural analysis versus context-sensitive• e.g. flow-insensitive vs. flow-sensitive cfg
All the handlers are registered in a visitor Driver
Driver invoked by user’s request for information (demand driven) Build prepass data structures Invokes the right set of handlers in right order
(e.g. bottom-up traversal of call graph)
Pointer Alias Analysis
Steensgaard’s pointer alias analysis (completed) Flow-insensitive and context-insensitive, type-inference based
analysis Very efficient: near linear-time analysis Very inaccurate
Parallelization Analysis
Scalar analysis Mod/ref, reduction recognition: Bottom-up flow-insensitive Liveness for privatization: Bottom-up and top-down, flow-
sensitive Region-based array analysis
May-write, must-write, read, upwards-exposed read: bottom-up
Array liveness for privatization: bottom-up and top-down Uses our interprocedural framework + omega
Symbolic analysis Find linear relationships between scalar variables to improve
array analysis
Parallel Code Generation
Loop bound generation Use omega based on affine mappings
Outlining and cloning primitives
Special IR nodes to represent parallelization primitives Allows a succint and high-level description of
parallelization decision For communication to and from users Reduction and private variables and
primitives Synchronization and parallelization primitives
SUIF+par IR
SUIF
SUIF
Status
Completed Call graphs, scc Steensgaard’s pointer alias analysis Integration of garbage collector with SUIF
In progress Interprocedural analysis framework Array summaries Scalar dependence analysis Parallel code generation
To be done Scalar symbolic analysis
Future work: Basic compiler research
A flexible and integrated platform for new optimizations
Combinations of pointers, OO, parallelization optimizations to parallelize or SIMDize (MMX) multimedia applications
Interaction between garbage collection, exception handling with back end optimizations
Embedded compilers with application-specific additions at the source language and architectural level
As a Useful Compiler for High-Performance Computers
Basic ingredients of a state-of-the-art parallelizing compiler Requires experimentation, tuning, refinement
First implementation of affine partitioning Interprocedural parallelization requires many analyses working
together Missing functions
Automatic data distribution User interaction needed for parallelizing large code region
SUIF Explorer - a prototype interactive parallelizer in SUIF1 Requires tools: algorithm to guide performance tuning,
program slices, visualization tools New techniques
Extend affine mapping to sparse codes (with permutation index arrays) Fortran 90 front end Debugging support
New-Generation Productivity Tool
Apply high-level program analysis to increase programmers’ productivity
Many existing analyses High-level, interprocedural side effect analysis with pointers and
arrays New analyses
Flow and context sensitive pointer alias analysis Interprocedural control-path based analysis
Examples of tools Find bugs in program Prove or disapprove user invariants Generate test cases Interactive demand-driven analysis to aid in program debugging Can also apply to Verilog/VHDL to improve hardware verification
Finally ...
The system has to be actively maintained and supported to be useful.
The End