triolet c++/python productive programming in...
TRANSCRIPT
![Page 1: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/1.jpg)
Triolet C++/Python –
productive programming in
heterogeneous parallel systems
Wen-mei Hwu
Sanders AMD Chair, ECE and CS
University of Illinois, Urbana-Champaign
CTO, MulticoreWare
![Page 2: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/2.jpg)
Agenda
• Performance portability of imperative parallel
programming
– OpenCL
• Algorithm selection, scalability, and efficiency of
intentional parallel programming
– Triolet C++
– Triolet Python
ACS Productivity Workshop, 2014
![Page 3: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/3.jpg)
Essential work in writing efficient
parallel code.
• Distribute computation across
• cores,
• hardware threads, and
• vector processing elements
• Distribute data across
• discrete GPUs or
• clusters
• Orchestrate communication for
• reductions,
• variable-size list creation,
• stencils, etc.
• Rearrange data for locality
• Fuse or split loops
• Map loop iterations onto hardware
• Allocate memory
• Partition data
• Insert data movement code
• Reduction trees
• Array packing
• Boundary cell communication
Planning how to execute
an algorithm
Implementing the plan
ACS Productivity Workshop, 2014
![Page 4: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/4.jpg)
CPU Xeon PhiMulticore GPU FPGA
Current State of Programming
Heterogeneous Systems
ACS Productivity Workshop, 2014
![Page 5: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/5.jpg)
CPU Xeon PhiMulticore GPU FPGA
Current State of Programming
Heterogeneous Systems
C/FORTRAN
ACS Productivity Workshop, 2014
![Page 6: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/6.jpg)
CPU Xeon PhiMulticore GPU FPGA
Current State of Programming
Heterogeneous Systems
C/FORTRAN
+ Directives
(OpenMP/ TBB)
ACS Productivity Workshop, 2014
![Page 7: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/7.jpg)
CPU Xeon PhiMulticore GPU FPGA
Current State of Programming
Heterogeneous Systems
C/FORTRAN
+ Directives
(OpenMP/ TBB)
+ SIMD
Intrinsics
ACS Productivity Workshop, 2014
![Page 8: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/8.jpg)
CPU Xeon Phi
OpenCL
Multicore GPU FPGA
Current State of Programming
Heterogeneous Systems
C/FORTRAN
+ Directives
(OpenMP/ TBB)
+ SIMD
Intrinsics
ACS Productivity Workshop, 2014
![Page 9: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/9.jpg)
CPU Xeon Phi
OpenCL
Multicore GPU FPGA
Current State of Programming
Heterogeneous Systems
C/FORTRAN
+ Directives
(OpenMP/ TBB)
+ SIMD
Intrinsics
Verilog/VHDL
ACS Productivity Workshop, 2014
![Page 10: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/10.jpg)
CPU Xeon Phi
OpenCL
Multicore GPU FPGA
Current State of Programming
Heterogeneous Systems
C/FORTRAN
+ Directives
(OpenMP/ TBB)
+ SIMD
Intrinsics
Verilog/VHDL
Programming heterogeneous systems requires
too many versions of code!
ACS Productivity Workshop, 2014
![Page 11: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/11.jpg)
CPU Xeon Phi
OpenCL
Multicore GPU FPGA
MxPA FOpenCL
Productivity in Programming
Heterogeneous Systems
C/FORTRAN
+ Directives
(OpenMP/ TBB)
+ SIMD
Intrinsics
Verilog/VHDL
Step #1: Keep only one of the versions and use
portability tools to generate the others.
ACS Productivity Workshop, 2014
![Page 12: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/12.jpg)
MxPA: Overview
• Optimizes scheduling of work-item execution for locality
and vectorization for the hardware
• Locality-centric compile-time scheduling selects between:
• BFO (Breadth-First Order): issue each memory access for all work-
items first
• DFO (Depth-First Order): issue all memory accesses for a single
work-item first
• Kernel fusion run-time scheduling reduces memory traffic
for common data flow between kernels
• Dynamic vectorization maintains vector execution in spite
of apparent control flow divergence
ACS Productivity Workshop, 2014
![Page 13: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/13.jpg)
MxPA: Locality-centric Scheduling
barrier
i0
in-1
j0
jm-1
wi0 wi1 wiLS-1
Work-group
CLKernel() {// e.g. SOAfor (i = 0..N) {.. = A[c0*i + wid];
}barrier();// e.g. AOSfor (j = 0..M) {.. = B[c1*wid + j];
}}
An example OpenCL code
CLKernel_MXPA() {// BFOfor (i = 0..N) {for (wid = 0..LS) {.. = A[c0*i + wid];
}}
// DFOfor (wid = 0..LS) {for (j = 0..M) {.. = B[c1*wid + j];
}}
}
MXPA-translated code
Dependency in executing OpenCL kernels
ACS Productivity Workshop, 2014
![Page 14: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/14.jpg)
MxPA: Dynamic Vectorization of
Control Divergent Loopswork items
loop ite
rations
Effective for Sparse methods
and graph algorithms.divide into sub-blocks
vectorization
phase
subvectorization
phase
![Page 15: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/15.jpg)
MxPA Results: Comparison to Industry
0.00
0.20
0.40
0.60
0.80
1.00sg
m
km
ns
sc
mriq
ctc
p
lud
lkct
mrig pf
nw
sa
d
tpcf
sp
mv
rbfs
pb
fs
hw fft
hst
cfd
geo
+ Locality =Locality - Locality
Sp
ee
du
p(n
orm
aliz
ed
–h
igh
er
is
be
tte
r)
LC AMD+ Locality improves = Locality stays the same - Locality gets worse
0.00
0.20
0.40
0.60
0.80
1.00
sg
m pf
km
ns
sc
mriq
sa
d
lmd
spm
v
pb
fs
tpcf
nw
ctc
p
mrig
hw
lkct
fft
rbfs
hst
lud
cfd
geo
+ Locality =Locality - Locality
Sp
ee
du
p(n
orm
aliz
ed
–h
igh
er
is
be
tte
r)
LC Intel+ Locality improves = Locality stays the same - Locality gets worse
ACS Productivity Workshop, 2014
![Page 16: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/16.jpg)
MxPA Results: Comparison to Industry
Metric LC/AMD LC/Intel
Speedup 3.20x 1.62x
L1 Data Cache Misses 0.12x 0.36x
Data TLB Misses 0.23x 0.33x
LLC Misses 0.91x 0.97x
ACS Productivity Workshop, 2014
![Page 17: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/17.jpg)
MxPA Case Study: MOCFE-Bone
Input Configurations: Nodes = 1, Groups = 25, Angles = 128, MeshScale=10 (Elements=103)
0
0.2
0.4
0.6
0.8
1
1.2
Base (single-threadFORTRAN)
OpenCL (Fermi) OpenCL (single-thread MxPA)
OpenCL(multithreaded MxPA)
Exec
uti
on
Tim
e(n
orm
.)
Program Version
Fixed Transformed
ACS Productivity Workshop, 2014
![Page 18: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/18.jpg)
HIGH-LEVEL INTERFACE
ACS Productivity Workshop, 2014
![Page 19: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/19.jpg)
Who does the hard work in
parallelization?
• General-purpose language + parallelizing compiler
– Requires a very intelligent compiler
– Limited success outside of regular array algorithms
• Delite - Domain-specific language + domain-specific
compiler
– Simplify compiler’s job with language restrictions and extensions
– Requires customizing a compiler for each domain
• Triolet - Parallel library + optimizing compiler
– Library makes parallelization decisions
– Uses a rich transformation, library aware compiler
– Extensible—just add library functions
ACS Productivity Workshop, 2014
![Page 20: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/20.jpg)
CPU Xeon Phi
OpenCL
Multicore GPU FPGA
MxPAC/FORTRAN
+ Directives
(OpenMP/ TBB)
+ SIMD
Intrinsics
Verilog/VHDL
Step #2: Use a higher-level algorithm representation.
ACS Productivity Workshop, 2014
Triolet
Triolet
Compiler
![Page 21: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/21.jpg)
C/FORTRAN
+ Directives
(OpenMP/ TBB)
OpenCL
Verilog/VHDL
+ SIMD
Intrinsics
Triolet
Triolet
Compiler
ACS Productivity Workshop, 2014
CPU Xeon PhiMulticore GPU FPGA
![Page 22: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/22.jpg)
Triolet C++
• Goal – design and implement a simple, user-
friendly interface for communicating the intended
data access and computation patterns to a
library-aware C++ compiler
• Technical approach
– Data objects allow changes to logical data
organization and content without touching storage
– Computations based on map and reduce
– Aggressive algorithm selection and auto-tuning
through hardware specific library implementation
– Compiler code synthesis technology for target
hardware built on MxPA
ACS Productivity Workshop, 2014
![Page 23: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/23.jpg)
Triolet/C++ Convolution Code
std::vector<int> 2dconvolution(const std::vector<int>& input,
int x_size, int y_size,
const std::vector<float>& kernel)
{
auto input_co = make_matrix<int, 2>(input, x_size, y_size);
}
Ask the compiler to treat the C++ input array as a 2D
matrix object whose dimensions are given by
arguments x_size and y_size.
ACS Productivity Workshop, 2014
![Page 24: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/24.jpg)
Triolet/C++ Convolution Code
std::vector<int> 2dconvolution(const std::vector<int>& input,
int x_size, int y_size,
const std::vector<float>& kernel)
{
auto input_co = make_matrix<int, 2>(input, x_size, y_size);
auto kernel_co = make_small_vector<int>(kernel);
Treat the C++ kernel array as a small 1D vector in
preparation for dot product .
ACS Productivity Workshop, 2014
![Page 25: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/25.jpg)
Triolet/C++ Convolution Code
std::vector<int> 2dconvolution(const std::vector<int>& input,
int x_size, int y_size,
const std::vector<float>& kernel)
{
auto input_co = make_matrix<int, 2>(input, x_size, y_size);
auto kernel_co = make_small_vector<int>(kernel);
auto stencil_co = make_stencil_transform<9,9>(input_co);
}
Conceptually form a 2D x_size by y_size matrix
whose elements are the 9x9 neighbor stencils around
the original input_co elements.
ACS Productivity Workshop, 2014
![Page 26: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/26.jpg)
Triolet/C++ Convolution Code
std::vector<int> 2dconvolution(const std::vector<int>& input,
int x_size, int y_size,
const std::vector<float>& kernel)
{
auto input_co = make_matrix<int, 2>(input, x_size, y_size);
auto kernel_co = make_small_vector<int>(kernel);
auto stencil_co = make_stencil_transform<9,9>(input_co);
auto const_co = make_const_transform(kernel_co, x_size*y_size);
}
Conceptually replicate kernel_co into an x_size by y_size
stencil matrix
ACS Productivity Workshop, 2014
![Page 27: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/27.jpg)
Triolet/C++ Convolution Code
std::vector<int> 2dconvolution(const std::vector<int>& input,
int x_size, int y_size,
const std::vector<float>& kernel)
{
auto input_co = make_matrix<int, 2>(input, x_size, y_size);
auto kernel_co = make_small_vector<int>(kernel);
auto stencil_co = make_stencil_transform<9,9>(input_co);
auto const_co = make_const_transform(kernel_co, x_size * y_size);
auto zip_co = make_zip_transform(stencil_co, const_co);
}
Conceptually form an x_size by y_size matrix where each
element is a tuple of one stencil_co element and one
const_co element.
ACS Productivity Workshop, 2014
![Page 28: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/28.jpg)
Convolution Example
(5x5 Zipped Matrix)
zip(filters, stencils)
ACS Productivity Workshop, 2014
![Page 29: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/29.jpg)
Triolet/C++ Convolution Codestd::vector<int> 2dconvolution(const std::vector<int>& input,
int x_size, int y_size,
const std::vector<float>& kernel)
{
auto input_co = make_matrix<int, 2>(input, x_size, y_size);
auto kernel_co = make_small_vector<int>(kernel);
auto stencil_co = make_stencil_transform<9,9>(input_co);
auto const_co = make_const_transform(kernel_co, x_size * y_size);
auto zip_co = make_zip_transform(stencil_co, const_co);
auto map_co = make_map_transform(zip_co,map_operation<mul_op>());
}
Perform pair-wise multiplication onto all zipped elements
ACS Productivity Workshop, 2014
![Page 30: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/30.jpg)
Triolet/C++ Convolution Codestd::vector<int> 2dconvolution(const std::vector<int>& input,
int x_size, int y_size,
const std::vector<float>& kernel)
{
auto input_co = make_matrix<int, 2>(input, x_size, y_size);
auto kernel_co = make_small_vector<int>(kernel);
auto stencil_co = make_stencil_transform<9,9>(input_co);
auto const_co = make_const_transform(kernel_co, x_size * y_size);
auto zip_co = make_zip_transform(stencil_co, const_co);
auto map_co = make_map_transform(zip_co,map_operation<mul_op>());
auto reduce_co = make_map_transform(map_co,reduce_operation<add_op>());
}
Perform vector reduction on all map_co elements, this
finishes convolution
ACS Productivity Workshop, 2014
![Page 31: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/31.jpg)
Triolet/C++ Convolution Codestd::vector<int> 2dconvolution(const std::vector<int>& input,
int x_size, int y_size,
const std::vector<float>& kernel)
{
auto input_co = make_matrix<int, 2>(input, x_size, y_size);
auto kernel_co = make_small_vector<int>(kernel);
auto stencil_co = make_stencil_transform<9,9>(input_co);
auto const_co = make_const_transform(kernel_co, x_size * y_size);
auto zip_co = make_zip_transform(stencil_co, const_co);
auto map_co = make_map_transform(zip_co,map_operation<mul_op>());
auto reduce_co = make_map_transform(map_co,reduce_operation<add_op>());
std::vector<int> output = Evaluate<std::vector, int>(reduce_co);
}
The compiler performs actual code synthesis
ACS Productivity Workshop, 2014
![Page 32: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/32.jpg)
Convolution Performance
• All versions (except Naive) are multithreaded and
vectorized
– 8x performance difference (Intel vs. Ours)
Triolet/C++
Run time measured on
Intel Core i7-3820
(4-core, hyperthreaded)
with TBB/SSE optimized with
Tangram
32
![Page 33: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/33.jpg)
Triolet C++ equalize_frames
Example
39.1
121.5
260.2
0
50
100
150
200
250
300
Baseline(CPU) Tuned CPU Tuned GPU
frame/sec
histo_equalizer (large dataset=4K)
• Baseline is parallel code taken from DARPA
PERFECT benchmark suite
• Tuned code outperforms baseline on both
architectures
CPU time measured on
Intel Xeon E5520
(4-core, hyperthreaded)
GPU time measured on
Tesla C2050
Triolet code
optimized with
Tangram 33
![Page 34: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/34.jpg)
10x10 Heterogeneous Architecture
Shared L1 Data Cache
<tbd
>
meng
ine
#6
FFT/
Sort
I-Cache I-Cache
meng
ine
#6
meng
ine
#5
I-Cache I-Cache
2nd Set (in process)1st Set
Local Memory
Basi
c
RIS
C
CPU
BnBGen
PM
I-Cache I-Cache I-Cache
DLT
I-Cache
• 1st set of micro-engines evaluation nearly complete
• Developing a set of complementary micro-engines
• Composite evaluation: 5x5
Architecture
• Global Power Management: power & performance models for core and link frequency changes + opportunity assessment
Application
Compiler • Compilation: Mapreduce, micro-engine, and robust vectorization and locality management
GPM
34
![Page 35: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/35.jpg)
Compiled Conv2D on BnB
35June 24-25, 2014
DARPA PERFECT - 10x10:
Systematic Heterogeneity
0
1,000,000,000
2,000,000,000
3,000,000,000
4,000,000,000
5,000,000,000
6,000,000,000
insts Cycles Energy (nJ)
Base RISC
BnB
~10.5x improvement across the board
1024x1024 image, HMC memory system
![Page 36: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/36.jpg)
Nonuniform FT (real part)
Programming in Triolet Python
![Page 37: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/37.jpg)
Nonuniform FT (real part)
Programming in Triolet Python
Inner loop
ys = [ sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)]
Inner loop
![Page 38: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/38.jpg)
Nonuniform FT (real part)
Programming in Triolet Python
Outer loopInner loop
ys = [ sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)]
Inner loop Outer loop
![Page 39: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/39.jpg)
Nonuniform FT (real part)
Programming in Triolet Python
Outer loopInner loop
ys = [ sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)]
Inner loop Outer loop
• “map and reduce” style programming—no new paradigm to learn• Parallel details are implicit—easy to use
• Automated data partitioning, MPI rank generation, MPI messaging, OpenMP, etc.
• Race-free, type-safe—no crashes or nondeterminism(with standard caveat about operator associativity)
![Page 40: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/40.jpg)
Productivity and Performance
ys = [sum(x * cos(r*k) for (x, k) in zip(xs, ks))for r in par(rs)]
!!"#$!%&"'#&()*+,!%&"'"-.!!/01'2)%%'+"345/01'26//'7689:,!;%&"'#&()*+<.!!/01'2)%%'(=#>5/01'26//'7689:,!;%&"'"-<.!!*)#+$!"#$!(=#>?!@!%&"'"-!@@!?.
!!/01'A*=+$5;+"34'B,!C,!/01'1DE,!?,!/01'26//'7689:<.!!/01'A*=+$5;+"34'>,!C,!/01'1DE,!?,!/01'26//'7689:<.
!!*)#+$!"#$!*FG#>'+"34'B!@!*4"H-"I5+"34'B,!%&"'#&()*+<.!!*)#+$!"#$!&=--4-'+"34'B!@!*FG#>'+"34'B!J!%&"'#&()*+.
!!KH)=$!J>+,!JB+.!!"K!5(=#>?<!L!!!!>+!@!"#&G$'>+.!!!!B+!@!"#&G$'B+.!!M!!4H+4!L!!!!>+!@!%=HH)*5+"34'>!J!+"34)K5KH)=$<<.!!!!B+!@!%=HH)*5+"34'>!J!+"34)K5KH)=$<<.!!M
!!KH)=$!J(+'*FG#>!@!%=HH)*5*FG#>'+"34'B!J!+"34)K5KH)=$<<.!!KH)=$!JN+'*FG#>!@!%=HH)*5*FG#>'+"34'B!J!+"34)K5KH)=$<<.
!!"K!5(=#>?<!L!!!!"#$!#O)(>4(+!@!%&"'#&()*+PC.!!!!/01'84QG4+$!J(4Q+!@!%=HH)*5#O)(>4(+!J!R!J!+"34)K5/01'84QG4+$<<.!!!!"#$!O.!!!!K)(!5O!@!?.!O!S!#O)(>4(+.!OTT<!L!!!!!!"#$!O)(>4('"-!@!OTC.!!!!!!/01'1+4#-5>+,!+"34'>,!/01'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/01'26//'7689:,!;(4Q+WOX<.!!!!!!/01'1+4#-5B+,!+"34'>,!/01'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/01'26//'7689:,!;(4Q+W#O)(>4(+TOX<.!!!!!!/01'1+4#-5(+!T!O)(>4('"-J*FG#>'+"34'B,!*FG#>'+"34'B,!/01'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/01'26//'7689:,!;(4Q+WYJ#O)(>4(+TOX<.!!!!M!!!!%4%*&N5(+'*FG#>,!(+,!*FG#>'+"34'B!J!+"34)K5KH)=$<<.
!!!!/01'7="$=HH5#O)(>4(+JR,!(4Q+,!/01'ZEVE[Z\Z'1]D68\<.!!!!K(445(4Q+<.!!M!!4H+4!L!!!!/01'84*I5>+,!+"34'>,!/01'U96VE,!?,!!!!!!!!!!!!!?,!/01'26//'7689:,!/01'ZEVE[Z'1]D68\<.!!!!/01'84*I5B+,!+"34'>,!/01'U96VE,!?,!!!!!!!!!!!!!?,!/01'26//'7689:,!/01'ZEVE[Z'1]D68\<.!!!!/01'84*I5(+'*FG#>,!*FG#>'+"34'B,!/01'U96VE,!?,!!!!!!!!!!!!!?,!/01'26//'7689:,!/01'ZEVE[Z'1]D68\<.!!M
!!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
!!L!!!!"#$!"._&(=`%=!)%&!&=(=HH4H!K)(!+*F4-GH45+$=$"*<!!!!K)(!5"!@!?.!"!S!*FG#>'+"34'B.!"TT<!L!!!!!!KH)=$!+!@!?.!!!!!!"#$!a.!!!!!!K)(!5a!@!?.!a!S!+"34'>.!aTT<!!!!!!!!+!T@!B+WaX!J!*)+K5(+'*FG#>W"X!J!>+WaX<.!!!!!!N+'*FG#>W"X!@!+.!!!!M!!M
!!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
!!/01']=$F4(5N+'*FG#>,!*FG#>'+"34'B,!/01'U96VE,!N+,!*FG#>'+"34'B,!/01'U96VE,!!!!!!!!!!!!!?,!/01'26//'7689:<.
!!K(445(+'*FG#><.!!K(445N+'*FG#><.
!!"K!5(=#>?<!L!!!!K(445(+<.!!M!!4H+4!L!!!!K(445B+<.!!!!K(445>+<.!!M
Triolet C with MPI+OpenMP
Se
tup
Cle
anup
• Library functions factor out data decomposition,
parallelism, and communication
![Page 41: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/41.jpg)
Productivity and Performance
• Library functions factor out data decomposition,
parallelism, and communication
ys = [sum(x * cos(r*k) for (x, k) in zip(xs, ks))for r in par(rs)]
!!"#$!%&"'#&()*+,!%&"'"-.!!/01'2)%%'+"345/01'26//'7689:,!;%&"'#&()*+<.!!/01'2)%%'(=#>5/01'26//'7689:,!;%&"'"-<.!!*)#+$!"#$!(=#>?!@!%&"'"-!@@!?.
!!/01'A*=+$5;+"34'B,!C,!/01'1DE,!?,!/01'26//'7689:<.!!/01'A*=+$5;+"34'>,!C,!/01'1DE,!?,!/01'26//'7689:<.
!!*)#+$!"#$!*FG#>'+"34'B!@!*4"H-"I5+"34'B,!%&"'#&()*+<.!!*)#+$!"#$!&=--4-'+"34'B!@!*FG#>'+"34'B!J!%&"'#&()*+.
!!KH)=$!J>+,!JB+.!!"K!5(=#>?<!L!!!!>+!@!"#&G$'>+.!!!!B+!@!"#&G$'B+.!!M!!4H+4!L!!!!>+!@!%=HH)*5+"34'>!J!+"34)K5KH)=$<<.!!!!B+!@!%=HH)*5+"34'>!J!+"34)K5KH)=$<<.!!M
!!KH)=$!J(+'*FG#>!@!%=HH)*5*FG#>'+"34'B!J!+"34)K5KH)=$<<.!!KH)=$!JN+'*FG#>!@!%=HH)*5*FG#>'+"34'B!J!+"34)K5KH)=$<<.
!!"K!5(=#>?<!L!!!!"#$!#O)(>4(+!@!%&"'#&()*+PC.!!!!/01'84QG4+$!J(4Q+!@!%=HH)*5#O)(>4(+!J!R!J!+"34)K5/01'84QG4+$<<.!!!!"#$!O.!!!!K)(!5O!@!?.!O!S!#O)(>4(+.!OTT<!L!!!!!!"#$!O)(>4('"-!@!OTC.!!!!!!/01'1+4#-5>+,!+"34'>,!/01'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/01'26//'7689:,!;(4Q+WOX<.!!!!!!/01'1+4#-5B+,!+"34'>,!/01'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/01'26//'7689:,!;(4Q+W#O)(>4(+TOX<.!!!!!!/01'1+4#-5(+!T!O)(>4('"-J*FG#>'+"34'B,!*FG#>'+"34'B,!/01'U96VE,!O)(>4('"-,!!!!!!!!!!!!!!!!?,!/01'26//'7689:,!;(4Q+WYJ#O)(>4(+TOX<.!!!!M!!!!%4%*&N5(+'*FG#>,!(+,!*FG#>'+"34'B!J!+"34)K5KH)=$<<.
!!!!/01'7="$=HH5#O)(>4(+JR,!(4Q+,!/01'ZEVE[Z\Z'1]D68\<.!!!!K(445(4Q+<.!!M!!4H+4!L!!!!/01'84*I5>+,!+"34'>,!/01'U96VE,!?,!!!!!!!!!!!!!?,!/01'26//'7689:,!/01'ZEVE[Z'1]D68\<.!!!!/01'84*I5B+,!+"34'>,!/01'U96VE,!?,!!!!!!!!!!!!!?,!/01'26//'7689:,!/01'ZEVE[Z'1]D68\<.!!!!/01'84*I5(+'*FG#>,!*FG#>'+"34'B,!/01'U96VE,!?,!!!!!!!!!!!!!?,!/01'26//'7689:,!/01'ZEVE[Z'1]D68\<.!!M
!!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
!!L!!!!"#$!"._&(=`%=!)%&!&=(=HH4H!K)(!+*F4-GH45+$=$"*<!!!!K)(!5"!@!?.!"!S!*FG#>'+"34'B.!"TT<!L!!!!!!KH)=$!+!@!?.!!!!!!"#$!a.!!!!!!K)(!5a!@!?.!a!S!+"34'>.!aTT<!!!!!!!!+!T@!B+WaX!J!*)+K5(+'*FG#>W"X!J!>+WaX<.!!!!!!N+'*FG#>W"X!@!+.!!!!M!!M
!!^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
!!/01']=$F4(5N+'*FG#>,!*FG#>'+"34'B,!/01'U96VE,!N+,!*FG#>'+"34'B,!/01'U96VE,!!!!!!!!!!!!!?,!/01'26//'7689:<.
!!K(445(+'*FG#><.!!K(445N+'*FG#><.
!!"K!5(=#>?<!L!!!!K(445(+<.!!M!!4H+4!L!!!!K(445B+<.!!!!K(445>+<.!!M
Triolet C with MPI+OpenMP
Se
tup
Cle
anup
128-way Speedup
(16 cores × 8 nodes)
TrioletC with
MPI+OpenMP
99 115
![Page 42: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/42.jpg)
Cluster-Parallel Performance and Scalability
• Triolet delivers large
speedup over
sequential C
– On par with manually
parallelized C
– Except in CUTCP;
needs better GC
policy for large arrays
• Similar high-level
interfaces incur
additional overhead
– Message passing
– Array split/merge
– Run time variability
MRI-Q TPACF
SGEMM CUTCP
Speedup o
ver
se
quential C
code
Number of cores Number of cores
Rodrigues, et al PPoPP 2014
ACS Productivity Workshop, 2014
![Page 43: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/43.jpg)
Triolet Pattern Domain Coverage
Example
Benchmark Extra Patterns Vectorizable
Discrete Wavelet
Transform
deinterleave,
regions
pixels
2 D Convolutions regions pixels
Histogram Equalization histogram, scan, lut pixels, bins
System Solver triangular loop matrix rows
Inner Product - channels
Outer Product triangular loop channels
Interpolation 1 - range coords
Interpolation 2 lut range coords
Back Projection lut
Debayer interleave, regions pixels
Image Registration - pixels
Change Detection - -
Sort sort array elements
FFT 1D fft
FFT 2D fft
• Benchmarks suitable for HOF
library
– Each loop has one output
– Outer parallelizable map in
many benchmarks
– Small set of computation
patterns used repeatedly
• Loops: map, reduce
• Data merging: zip, outer
product
• array reshaping
• 8 of 15 benchmarks already
ported to Triolet Python
![Page 44: Triolet C++/Python productive programming in …impact.crhc.illinois.edu/Shared/PR/DoD-7-16-2014_final.pdf2014/07/16 · Triolet C++/Python –productive programming in heterogeneous](https://reader034.vdocuments.net/reader034/viewer/2022050506/5f981e1c0cb87e0cbb62f509/html5/thumbnails/44.jpg)
Conclusion• Near-term impact
– MxPA locality-centric scheduling, kernel fusion scheduling, and
dynamic vectorization make OpenCL kernel performance portability a
reality – ready for industry impact
– MulticoreWare MxPA product with several customers including
Movidius and Samsung
• Medium term impact
– Triolet C++ brings intentional programming into C++, giving
CUDA/OpenCL/OpenMP/OpenACC developers a much more
productive, maintainable, protable new option
– Immediate commercial opportunity in mobile and server SOCs
• Long-term outlook
– Triolet Python further brings intentional programming into
heterogeneous computing server clusters and distributed computing
– Triolet Java?
ACS Productivity Workshop, 2014