![Page 1: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/1.jpg)
A Domain-Specific Language and Compiler for Stencil Computations for Different Target Architectures
J. “Ram” Ramanujam Louisiana State University
SIMAC3 Workshop, Boston Univ.
![Page 2: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/2.jpg)
Acknowledgments
Collaborators
Albert Cohen (ENS Paris)
Franz Franchetti (CMU)
Louis-Noel Pouchet (UCLA)
P. Sadayappan (OSU)
Fabrice Rastello (ENS Lyon)
Nasko Rountev (OSU)
Sven Verdoolaege (ENS)
Paul Kelly (Imperial)
Michelle Strout (Colo. St.)
Carlo Bertolli (IBM Res.)
Sriram Krishnamoorthy (PNNL)
Uday Bondhugula (IISc)
Muthu Baskaran (Reservoir)
Tobias Grosser
Albert Hartono
Justin Holewinski
Venmugil Elango
Tom Henretty
Mahesh Ravishankar
Sanket Tavarageri
Richard Veras
Sameer Abu Asal
Rod Tohid
Funding
Natl. Science Foundn
US Army
![Page 3: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/3.jpg)
Why Domain-Specific Languages?
• Productivity
– High level abstractions ease application development
![Page 4: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/4.jpg)
Why Domain-Specific Languages?
• Productivity
– High level abstractions ease application development
• Performance
– Domain-specific semantics enables specialized optimizations
– Constraints on specification enables more effective general-purpose transformations and tuning (tiling, fusion)
![Page 5: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/5.jpg)
Why Domain-Specific Languages?
• Productivity
– High level abstractions eases application development
• Performance
– Domain-specific semantics enables specialized optimizations
– Constraints on specification enables more effective general-purpose transformations and tuning (tiling, fusion)
• Portability
– New architectures => changes only in domain-specific compiler, without any change in user application code
![Page 6: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/6.jpg)
(Embedded) DSLs for Stencils
• Benefits of high-level specification of computations – Ease of use
• For mathematicians/scientists creating the code
– Ease of optimization • Facilitate loop and data transformations by compiler
• Automatic transformation by compiler into parallel C/C++ code
• Embedded DSL provides flexibility – Generality of standard programming language (C, MATLAB)
for non compute-intensive parts
– Automated transformation of embedded DSL code for high performance on different target architectures
• Target architectures for Stencil DSL – Vector-SIMD (AVX, LRBNi, ..), GPU, FPGA, customized accelerators
![Page 7: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/7.jpg)
Stencil DSL Example -- Standalone int Nr; int Nc; grid g [Nr][Nc]; double griddata a on g at 0,1; pointfunction five_point_avg(p) { double ONE_FIFTH = 0.2; [1]p[0][0] = ONE_FIFTH*([0]p[-1][0] + [0]p[0][-1] + [0]p[0][0] + [0]p[0][1] + [0]p[1][0]); } iterate 1000 { stencil jacobi_2d { [0 ][0:Nc-1] : [1]a[0][0] = [0]a[0][0]; [Nr-1 ][0:Nc-1] : [1]a[0][0] = [0]a[0][0]; [0:Nr-1][0 ] : [1]a[0][0] = [0]a[0][0]; [0:Nr-1][Nc-1 ] : [1]a[0][0] = [0]a[0][0]; [1:Nr-2][1:Nc-2] : five_point_avg(a); } reduction max_diff max { [0:Nr-1][0:Nc-1] : fabs([1]a[0][0] - [0]a[0][0]); } } check (max_diff < .00001) every 4 iterations
![Page 8: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/8.jpg)
Stencil DSL Example -- Standalone int Nr; int Nc; grid g [Nr][Nc]; double griddata a on g at 0,1; pointfunction five_point_avg(p) { double ONE_FIFTH = 0.2; [1]p[0][0] = ONE_FIFTH*([0]p[-1][0] + [0]p[0][-1] + [0]p[0][0] + [0]p[0][1] + [0]p[1][0]); } iterate 1000 { stencil jacobi_2d { [0 ][0:Nc-1] : [1]a[0][0] = [0]a[0][0]; [Nr-1 ][0:Nc-1] : [1]a[0][0] = [0]a[0][0]; [0:Nr-1][0 ] : [1]a[0][0] = [0]a[0][0]; [0:Nr-1][Nc-1 ] : [1]a[0][0] = [0]a[0][0]; [1:Nr-2][1:Nc-2] : five_point_avg(a); } reduction max_diff max { [0:Nr-1][0:Nc-1] : fabs([1]a[0][0] - [0]a[0][0]); } } check (max_diff < .00001) every 4 iterations
Reference data over two time steps: current(0) and next (1)
![Page 9: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/9.jpg)
Stencil DSL Example -- Standalone int Nr; int Nc; grid g [Nr][Nc]; double griddata a on g at 0,1; pointfunction five_point_avg(p) { double ONE_FIFTH = 0.2; [1]p[0][0] = ONE_FIFTH*([0]p[-1][0] + [0]p[0][-1] + [0]p[0][0] + [0]p[0][1] + [0]p[1][0]); } iterate 1000 { stencil jacobi_2d { [0 ][0:Nc-1] : [1]a[0][0] = [0]a[0][0]; [Nr-1 ][0:Nc-1] : [1]a[0][0] = [0]a[0][0]; [0:Nr-1][0 ] : [1]a[0][0] = [0]a[0][0]; [0:Nr-1][Nc-1 ] : [1]a[0][0] = [0]a[0][0]; [1:Nr-2][1:Nc-2] : five_point_avg(a); } reduction max_diff max { [0:Nr-1][0:Nc-1] : fabs([1]a[0][0] - [0]a[0][0]); } } check (max_diff < .00001) every 4 iterations
Specify computations on borders
![Page 10: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/10.jpg)
Stencil DSL – Embedded in C int main() { int Nr = 256; int Nc = 256; int T = 100; double *a = malloc(Nc*Nr*sizeof(double)); #pragma sdsl start time_steps:T block:8,8,8 tile:1,3,1 time:4 int Nr; int Nc; grid g [Nr][Nc]; double griddata a on g at 0,1; pointfunction five_point_avg(p) { double ONE_FIFTH = 0.2; [1]p[0][0] = ONE_FIFTH*([0]p[-1][0] + [0]p[0][-1] + [0]p[0][0] + [0]p[0][1] + [0]p[1][0]); } iterate 1000 { stencil jacobi_2d { [0 ][0:Nc-1] : [1]a[0][0] = [0]a[0][0]; [Nr-1 ][0:Nc-1] : [1]a[0][0] = [0]a[0][0]; [0:Nr-1][0 ] : [1]a[0][0] = [0]a[0][0]; [0:Nr-1][Nc-1 ] : [1]a[0][0] = [0]a[0][0]; [1:Nr-2][1:Nc-2] : five_point_avg(a);} reduction max_diff max { [0:Nr-1][0:Nr-1] : fabs([1]a[0][0] - [0]a[0][0]); } } check (max_diff < .00001) every 4 iterations #pragma sdsl end }
![Page 11: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/11.jpg)
Related Work
• 20+ publications over the last few years on optimizing stencil computations
• Some stencil DSLs and stencil compilers
– Pochoir (MIT), PATUS (Basel), Mint (UCSD), Physis (Tokyo), Halide (MIT), Exastencils Project (Passau), …
• DSL Frameworks and libraries
– SEJITS (LBL); Liszt, OptiML, OptiQL (Stanford), PyOP2/OP2 (Imperial College, Oxford)
• Our focus has been complementary: developing abstraction-specific compiler transformations matched to performance-critical characteristics of target architecture
![Page 12: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/12.jpg)
Compilation of Stencil Codes
• Large class of applications
• Sweeps through a large data set
• Each data point: computed from “neighbors”
• Multiple time iterations
– Repeated access to same data
• Pipelined parallel execution
• Example: One-dimensional Jacobi for t = 1 to T
for i = 1 to N
B[i] = (A[i-1]+A[i]+A[i+1])/3
for i = 1 to N
A[i] = B[i]
for t = 1 to T
for i = 1 to N
A[t+1,i] =
(A[t,i-1]+
A[t,i]+A[t,i+1])/3
![Page 13: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/13.jpg)
Motivation
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
t
i
![Page 14: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/14.jpg)
Motivation
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 15: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/15.jpg)
Motivation
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 16: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/16.jpg)
Motivation
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 17: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/17.jpg)
Time Tiling (with 1-D array code)
• Time tiling causes pipelined execution
• Solution: Adjust tiling – re-enable concurrent execution in a row of tiles
Cache misses = (TN/B) No concurrent in a row
Cache misses = (TN)
Concurrency in each t
i
t t
i
![Page 18: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/18.jpg)
Motivation
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 19: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/19.jpg)
Motivation
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 20: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/20.jpg)
Motivation
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 21: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/21.jpg)
Motivation
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 22: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/22.jpg)
Motivation
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 23: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/23.jpg)
Motivation
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 24: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/24.jpg)
Motivation
t
i
“Sequentializing” dependence between tiles
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 25: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/25.jpg)
Example
t
i
“Sequentializing” dependences between tiles
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 26: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/26.jpg)
Example
t
Tile region from the tile on left (across the “backface”) that needs to be finished before this tile can start
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 27: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/27.jpg)
Overlapped Tiling
t
i
Overlapped Tiling
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 28: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/28.jpg)
Overlapped Tiling
t
i
Overlapped Tiling
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 29: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/29.jpg)
Overlapped Tiling
t
i
Overlapped Tiling
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 30: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/30.jpg)
Split Tiling
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 31: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/31.jpg)
Split Tiling
t
i
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 32: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/32.jpg)
Split Tiling
t
Phase 1: All of the green shaded regions can be executed concurrently (first)
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 33: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/33.jpg)
Example: Split Tiling
t
Phase 2: Then, all of the orange shaded regions can be executed concurrently (next)
FOR t = 0 TO T-1
FOR i = 1 TO N-1
A[t+1,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3
![Page 34: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/34.jpg)
Stencils on Vector-SIMD Processors • Fundamental source of
inefficiency with stencil codes on current short-vector SIMD ISAs (e.g. SSE, AVX …) – Concurrent operations on
contiguous elements – Each data element is reused in
different “slots” of vector register
– Redundant loads or shuffle ops needed
• Compiler transformations based on matching computational characteristics of stencils to vector-SIMD architecture characteristics
for (i=0; i<H; ++i) for (j=0; j<W; ++j) c[i][j]+=b[i][j]+b[i][j+1];
a b c d
m n o p
n o p q
a b c d e f g h i j k l
m n o p q r s t u v w x
Inefficiency: Each element of b is loaded twice
Data in memory
Vector registers
0 1 2 3
VR0
VR1
VR2
VR3
VR4
c[i][j]
b[i][j]
![Page 35: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/35.jpg)
• 1D vector in memory (b) 2D logical view of same data • (c) Transposed 2D array moves interacting elements into same slot of
different vectors (d) New 1D layout after transformation • Boundaries need special handling
Data Layout Transformation
a b c d
0 1 2 3
e f
0 1 2 3
g h i j k l
0 1 2 3 0 1 2 3
m n o p q r s t
0 1 2 3
u v w x
0 1 2 3
a b c d e f
g h i j k l
m n o p q r
s t u v w x
V
N
M
a g m s b h n t c i o u d j p v e k q w f l r x
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
V
N
M
(a) original layout
(b) dimension lifted (c) transposed
(d) transformed layout
for (i = 0; i < N; ++i) a[i]=b[i-1]+b[i]+b[i+1];
![Page 36: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/36.jpg)
Standard Tiling with DLT
Tile 1 Tile 2 Tile 3 Tile 4
Tile Dependences
t
i
(a) Standard tiling -- Linear view
(b) Standard tiling -- DLT view (t=1)
• Standard tiling cannot be used with the layout transform • Inter-tile dependences prevent vectorization
![Page 37: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/37.jpg)
Split Tiling
• Divide iteration space into upright and inverted tiles • For each tt timesteps where tt = time tile size…
• Execute upright tiles in parallel • Execute inverted tiles in parallel
• Upright tile size increases with time tile size
![Page 38: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/38.jpg)
Split Tiling: DLT View
• Tiles at t = 0
– Orange upright tiles
– Green inverted tiles
• Tiles in same vector slot
– Compute multiple tiles in parallel
– Some inverted tiles split DLT boundary
N = 40 Vector Length = 2 Upright Tile Base = 6 Inverted Tile Base = 4
![Page 39: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/39.jpg)
for (t=0;t<100;++t) {
for (i=1;i<999;++i)
f1: a1[i] = 0.33*(a0[i-1]+
a0[i ]+
a0[i+1]);
for (i=1;i<999;++i)
f2: a0[i] = a1[i];
}
Back-Slicing Analysis
• Need to find geometric properties of split tiles
– Slopes of tile in each dimension d
– Offset of each statement w.r.t. tile start, tile end
offsets for f1
![Page 40: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/40.jpg)
Dependence Summary Graph(DSG)
• Vertices represent statements • Edges represent dependence summaries for each
dimension • <𝛿L, 𝛿U> max/min spatial components of
flow and anti dependences • 𝛿T Time distance between statements
![Page 41: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/41.jpg)
Computing Slopes
rL (C) = C
å dL
C
å dT=
2
1= 2
• Compute cycle ratios 𝜌L(C), 𝜌U(C) for each cycle C of the DSG
rU (C) = C
å dU
C
å dT=
-2
1= -2
![Page 42: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/42.jpg)
Computing Slopes
• For each dimension d of the stencil…
– Lower bound slope 𝛼d is maximum cycle ratio
– Upper bound slope βd is minimum cycle ratio
ad = max rL (C)( )"C Î DSG = 2
bd = min rU (C)( )"C Î DSG = -2
𝛼1 = 2 β1 = -2
![Page 43: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/43.jpg)
Computing Offsets
• Build a system of validity constraints using loop bounds of upright tile code
• Results in system of linear
inequalities
offsets for f1
for (tt=...){
for (ii=...){
for (t=...){
for (i=ii+oLF1+αL*(t-tt);
i<ii+TU+oUF1+βU*(t-tt);
++i)
f1: a1[i] = 0.33*(a0[i-1]+
a0[i ]+
a0[i+1]);
for (i=ii+oLF2+αL*(t-tt);
i<ii+TU+oUF2+βU*(t-tt);
++i)
f2: a0[i] = a1[i];
}}}
![Page 44: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/44.jpg)
Computing Offsets
• For any pair of dependent statements, given a region over which the target statement is executed, the source statement should be executed over a region large enough to satisfy the dependence
for (tt=...){
for (ii=...){
for (t=...){
for (i=ii+oLF1+αL*(t-tt);
i<ii+TU+oUF1+βU*(t-tt);
++i)
f1: a1[i] = 0.33*(a0[i-1]+
a0[i ]+
a0[i+1]);
for (i=ii+oLF2+αL*(t-tt);
i<ii+TU+oUF2+βU*(t-tt);
++i)
f2: a0[i] = a1[i];
}}}
ii+ oLf 1 +a * t £ ii+oL
f 2 +a * t -1
ii+ oLf 2 +a *(t -1) £ ii+oL
f 1 +a * t -1
ii+TU + oUf 1 + b * t ³ ii+TU +oU
f 2 + b * t +1
ii+TU +oUf 2 + b *(t -1) £ ii+TU +oU
f 1 + b * t +1
Lower Bound Constraints
Upper Bound Constraints
![Page 45: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/45.jpg)
Computing Offsets
• Simplify to a system of difference constraints
• Solve with Bellman-Ford algorithm
oLf 1 - oL
f 2 £ -1
oLf 2 - oL
f 1 £ a -1
oUf 2 - oU
f 1 £ -1
oUf 1 - oU
f 2 £ -b -1
Lower Bound Constraints
Upper Bound Constraints
Bellman-Ford
oLf 1 = -1
oLf 2 = 0
oUf 1 =1
oUf 2 = 0
Lower Bound Offsets
Upper Bound Offsets
![Page 46: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/46.jpg)
Stencils on Multicore CPU: Performance
Intel Sandy Bridge
![Page 47: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/47.jpg)
Stencils on GPUs
• Vector-SIMD alignment problems non-existent
• Different optimization challenges: limited forms of synchronization, avoidance of thread divergence
• Overlapped tiling – Redundantly compute neighboring cells to avoid inter-
thread-block sync, lower communication, and avoid thread divergence
49 Logical Computation Actual Computation
at time t
Actual Computation
at time t+1
Elements needed at time t+1 Useless computation
![Page 48: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/48.jpg)
Stencils on GPU: Performance
Nvidia GTX 580
![Page 49: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/49.jpg)
Multi-Target Code Generation from SDSL
Multi-target Optimization and Code Generation
Multicore CPU
GPU
FPGA
Matlab/eSDSL
C/eSDSL
![Page 50: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/50.jpg)
Summary and Ongoing Work • Overlapped and split tiling to recover concurrency
(without startup overhead) in tiled execution of stencil computations.
• Stencil computations suffer from stream-alignment conflict for vector-SIMD ISAs – Data Layout Transformation to avoid the conflict – Split Tiling to enable concurrency along with DLT
• Overlapped tiling and split tiling on GPUs • Performance improvement over state-of-the-art for 1D
and 2D benchmarks – Further work to improve performance on 3D stencils
• Multi-target compiler for Stencil DSL in progress • Recent work on related fusion and tiling for
unstructured meshes (with Michelle Strout and Paul Kelly)
![Page 51: A Domain-Specific Language and Compiler for Stencil ...blogs.bu.edu/simac3/files/2013/11/06_Ramanuham_PDE.pdf•Automatic transformation by compiler into parallel C/C++ code •Embedded](https://reader035.vdocuments.net/reader035/viewer/2022081521/5ec809876b71fc07db4337be/html5/thumbnails/51.jpg)
Thank You