Beyond Shared Memory Loop Parallelism in the Polyhedral Model
Tomofumi Yuki Ph.D Dissertation
10/30 2012
The Problem
Figure from www.spiral.net/problem.html
2
Parallel Processing
n A small niche in the past, hot topic today n Ultimate Solution: Automatic Parallelization
n Extremely difficult problem n After decades of research, limited success
n Other solutions: Programming Models n Libraries (MPI, OpenMP, CnC, TBB, etc.) n Parallel languages (UPC, Chapel, X10, etc.) n Domain Specific Languages (stencils, etc.)
3
MPI Code Generation
Polyhedral X10
X10
AlphaZ
MDE
40+ years of research linear algebra, ILP
CLooG, ISL, Omega, PLuTo
Contributions
Polyhedral Model
4
Polyhedral State-of-the-art
n Tiling based parallelization n Extensions to parameterized tile sizes
n First step [Renganarayana2007] n Parallelization + Imperfectly nested
loops[Hartono2010, Kim2010]
n PLuTo approach is now used by many people n Wave-front of tiles: better strategy than
maximum parallelism [Bondhugula2008]
n Many advances in shared memory context
5
How far can shared memory go?
n The Memory Wall is still there n Does it make sense for 1000 cores to share
memory? [Berkley View, Shalf 07, Kumar 05] n Power n Coherency overhead n False sharing n Hierarchy? n Data volume (tera- peta-bytes)
6
Distributed Memory Parallelization n Problems implicitly handled by the shared
memory now need explicit treatment n Communication
n Which processors need to send/receive? n Which data to send/receive? n How to manage communication buffers?
n Data partitioning n How do you allocate memory across nodes?
7
MPI Code Generator
n Distributed Memory Parallelization n Tiling based n Parameterized tile sizes n C+MPI implementation
n Uniform dependences as key enabler n Many affine dependences can be uniformized
n Shared memory performance carried over to distributed memory n Scales as well as PLuTo but to multiple nodes
8
Related Work (Polyhedral)
n Polyhedral Approaches n Initial idea [Amarasinghe1993] n Analysis for fixed sized tiling [Claßen2006] n Further optimization [Bondhugula2011]
n “Brute Force” polyhedral analysis for handling communication n No hope of handling parametric tile size n Can handle arbitrarily affine programs
9
Outline
n Introduction n “Uniform-ness” of Affine Programs
n Uniformization n Uniform-ness of PolyBench
n MPI Code Generation n Tiling n Uniform-ness simplifies everything n Comparison against PLuTo with PolyBench
n Conclusions and Future Work
10
Affine vs Uniform
n Affine Dependences: f = Ax+b n Examples
n (i,j->j,i) n (i,j->i,i) n (i->0)
n Uniform Dependences: f = Ix+b n Examples
n (i,j->i-1,j) n (i->i-1)
11
Uniformization
n (i->0) n (i->0)
n (i->i-1)
12
Uniformization
n Uniformization is a classic technique n “solved” in the 1980’s n has been “forgotten” in the multi-core era
n Any affine dependence can be uniformized n by adding a dimension [Roychowdhury1988]
n Nullspace pipelining n simple technique for uniformization n many dependences are uniformized
13
Uniformization and Tiling
n Uniformization does not influence tilability
14
PolyBench [Pouchet2010]
n Collection of 30 polyhedral kernels n Proposed by Pouchet as a benchmark for
polyhedral compilation n Goal: Small enough benchmark so that
individual results are reported; no averages
n Kernels from: n data mining n linear algebra kernels, solvers n dynamic programming n stencil computations
15
Uniform-ness of PolyBench
n 5 of them are “incorrect” and are excluded
n Embedding: Match dimensions of statements n Phase Detection: Separate program into phases
n Output of a phase is used as inputs to the other
Stage Uniform at Start
After Embedding
After Pipelining
After Phase Detection
Number of Fully Uniform Programs
8/25 (32%)
13/25 (52%)
21/25 (84%)
24/25 (96%)
16
Outline
n Introduction n Uniform-ness of Affine Programs
n Uniformization n Uniform-ness of PolyBench
n MPI Code Generation n Tiling n Uniform-ness simplifies everything n Comparison against PLuTo with PolyBench
n Conclusions and Future Work
17
Basic Strategy: Tiling
n We focus on tilable programs
18
Dependences in Tilable Space
n All in the non-positive direction
19
Wave-front Parallelization
n All tiles with the same color can run in parallel
20
Assumptions
n Uniform in at least one of the dimensions n The uniform dimension is made outermost
n Tilable space is fully permutable
n One-dimensional processor allocation n Large enough tile sizes
n Dependences do not span multiple tiles
n Then, communication is extremely simplified
21
Processor Allocation
n Outermost tile loop is distributed
P0 P1 P2 P3 i1
i2
22
Values to be Communicated
n Faces of the tiles (may be thicker than 1)
i1
i2
P0 P1 P2 P3
23
Naïve Placement of Send and Receive Codes n Receiver is the consumer tile of the values
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
24
Problems in Naïve Placement
n Receiver is in the next wave-front time
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
t=0
t=1
t=2
t=3
25
Problems in Naïve Placement
n Receiver is in the next wave-front time n Number of communications “in-flight”���
= amount of parallelism n MPI_Send will deadlock
n May not return control if system buffer is full
n Asynchronous communication is required n Must manage your own buffer n required buffer = amount of parallelism
n i.e., number of virtual processors
26
Proposed Placement of Send and Receive codes n Receiver is one tile below the consumer
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
27
Placement within a Tile
n Naïve Placement: n Receive -> Compute -> Send
n Proposed Placement: n Issue asynchronous receive (MPI_Irecv) n Compute n Issue asynchronous send (MPI_Isend) n Wait for values to arrive
n Overlap of computation and communication n Only two buffers per physical processor
Overlap
Recv Buffer
Send Buffer
28
Evaluation
n Compare performance with PLuTo n Shared memory version with same strategy
n Cray: 24 cores per node, up to 96 cores n Goal: Similar scaling as PLuTo n Tile sizes are searched with educated guesses n PolyBench
n 7 are too small n 3 cannot be tiled or have limited parallelism n 9 cannot be used due to PLuTo/PolyBench issue
29
Performance Results
30
ª n Linear extrapolation from speed up of 24 cores n Broadcast cost at most 2.5 seconds
correlation covariance 2mm 3mm gemm syr2k syrk lu fdtd−2d jacobi−2dimper
seidel−2d
Summary of AlphaZ Performance Comparison with PLuTo
Sp
ee
d U
p w
ith
re
sp
ect
to P
Lu
To w
ith
1 c
ore
02
04
06
08
01
00
PLuTo 24 coresAlphaZ 24 coresPLuTo 96 cores (extrapolated)AlphaZ 96 cores (No Bcast)
AlphaZ System
n System for polyhedral design space exploration n Key features not explored by other tools:
n Memory allocation n Reductions
n Case studies to illustrate the importance of unexplored design space [LCPC2012]
n Polyhedral Equational Model [WOLFHPC2012]
n MDE applied to compilers [MODELS2011]
31
Polyhedral X10 [PPoPP2013?]
n Work with Vijay Saraswat and Paul Feautrier n Extension of array data flow analysis to X10
n supports finish/async but not clocks
n finish/async can express more than doall n Focus of polyhedral model so far: doall
n Dataflow result is used to detect races n With polyhedral precision, we can guarantee
program regions to be race-free
32
Conclusions
n Polyhedral Compilation has lots of potential n Memory/reductions are not explored n Successes in automatic parallelization n Race-free guarantee
n Handling arbitrary affine may be an overkill n Uniformization makes a lot of sense n Distributed memory parallelization made easy n Can handle most of PolyBench
33
Future Work
n Many direct extensions n Hybrid MPI+OpenMP with multi-level tiling n Partial uniformization to satisfy pre-condition n Handling clocks in Polyhedral X10
n More broad applications of polyhedral model n Approximations n Larger granularity: blocks of computations
instead of statements n Abstract interpretations [Alias2010]
34
Acknowledgements
n Advisor: Sanjay Rajopadhye n Committee members:
n Wim Böhm n Michelle Strout n Edwin Chong
n Unofficial Co-advisor: Steven Derrien n Members of
n Mélange, HPCM, CAIRN n Dave Wonnacott, Haverford students
35
Backup Slides
36
Uniformization and Tiling
n Tilability is preserved
37
D-Tiling Review [Kim2011]
n Parametric tiling for shared memory n Uses non-polyhedral skewing of tiles
n Required for wave-front execution of tiles
n The key equation: n n where
n d: number of tiled dimensions n ti: tile origins n ts: tile sizes
38
€
time =tiitsii=1
d∑
D-Tiling Review cont.
n The equation enables skewing of tiles n If one of time or tile origins are unknown, can
be computed from the others
n Generated Code: (tix is d-1th tile origin)
39
for (time=start:end)! for (ti1=ti1LB:ti1UB)! …! for (tix=tixLB:tixUB) {! tid = f(time, ti1, …, tix);! //compute tile ti1,ti2,…,tix,tid! }!
Placement of Receive Code using D-Tiling n Slight modification to the use of the equation
n Visit tiles in the next wave-front time
40
for (time=start:end)! for (ti1=ti1LB:ti1UB)! …! for (tix=tixLB:tixUB) {! tidNext = f(time+1, ti1, …, tix);! //receive and unpack buffer for! //tile ti1,ti2,…,tix,tidNext! }!
Proposed Placement of Send and Receive codes n Receiver is one tile below the consumer
i1
i2
P0 P1 P2 P3
S
S
S
R
R
R
41
Extensions to Schedule Independent Mapping n Schedule Independent Mapping [Strout1998]
n Universal Occupancy Vectors (UOVs) n Legal storage mapping for any legal execution n Uniform dependence programs only
n Universality of UOVs can be restricted n e.g., to tiled execution
n For tiled execution, shortest UOV can be found without any search
42
LU Decomposition
lu
Number of Cores
Sp
ee
d U
p w
ith r
esp
ect
to
PL
uTo
with
1 c
ore
0 8 16 24 48 72 96
08
16
24
48
72
96 PLuTo
AlphaZAlphaZ (No Bcast)
43
seidel-2d
seidel!2d
Number of Cores
Sp
ee
d U
p w
ith r
esp
ect
to
PL
uTo
with
1 c
ore
0 8 16 24 48 72 96
08
16
24
48
72
96 PLuTo
AlphaZAlphaZ (No Bcast)
44
seidel-2d (no 8x8x8)
seidel!2d (without 8x8x8 tiles)
Number of Cores
Sp
ee
d U
p w
ith r
esp
ect
to
PL
uTo
with
1 c
ore
0 8 16 24 48 72 96
08
16
24
48
72
96 PLuTo
AlphaZAlphaZ (No Bcast)
45
jacobi-2d-imper
jacobi!2d!imper
Number of Cores
Sp
ee
d U
p w
ith r
esp
ect
to
PL
uTo
with
1 c
ore
0 8 16 24 48 72 96
08
16
24
48
72
96 PLuTo
AlphaZAlphaZ (No Bcast)
46
Related Work (Non-Polyhedral)
n Global communications [Li1990] n Translation from shared memory programs n Pattern matching for global communications
n Paradigm [Banerjee1995] n No loop transformations n Finds parallel loops and inserts necessary
communications
n Tiling based [Goumas2006] n Perfectly nested uniform dependences
47
n PLuTo does not scale because the outer loop is not tiled
adi.c: Performance
Speedup of Optimized Code on Xeon
Number of Threads (Cores)
Spee
d up
com
pare
d to
orig
inal
cod
e
AlphaZPLuTo
0 1 2 4 8
01
24
8
Speedup of Optimized Code on Cray XT6m
Number of Threads (Cores)
Spee
d up
com
pare
d to
orig
inal
cod
e
AlphaZPLuTo
0 4 8 12 16 20 240
48
1216
2024
48
n Complexity reduction is empirically confirmed
UNAfold: Performance
200 400 600 800 1000 1400
050
010
0015
0020
0025
00
Execution Time of UNAfold
Sequence Length (N)
Exec
utio
n Ti
me
in S
econ
ds original simplified
2.0 2.2 2.4 2.6 2.8 3.0 3.2
01
23
45
67
8
Log plot of Execution Time
Log of Sequence Length
Log
of E
xecu
tion
Tim
e original simplified y = 4x + b1
y = 3x + b2
49
Contributions
n The AlphaZ System n Polyhedral compiler with full control to the user n Equational view of the polyhedral model
n MPI Code Generator n The first code generator with parametric tiling n Double buffering
n Polyhedral X10 n Extension to the polyhedral model n Race-free guarantee of X10 programs
50