harnessing opencl in modern coprocessors
DESCRIPTION
Talk @ APT Group, University of Manchester, 06 August 2014 Abstract: Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them. As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.TRANSCRIPT
Harnessing OpenCL in modern coprocessors
Unai [email protected]
06 Aug 2014
Intelligent Systems Group
University of the Basque Country UPV/EHU
Outline
• Previous work
• Work @ UniMan: Relational Join
1.Motivation
2.Algorithm
3.Results
4.Conclusions
2
About Myself• PhD Student @ Intelligent Systems Group: 2011 – Now
• Research interest: Efficient use of Modern coprocessors• Performance modeling• Code acceleration
• Development of parallel implementations• Molecular Dynamics simulation code (MSc thesis)• Kernel Density Estimation (Under review)• Relational Join (Work @ UniMan)
3
Kernel Density Estimation• Estimate the Probability Density Function of a population
• Our use case: Climate models
• Challenge: large volumes of data
4
Histogram: KDE:
Kernel Density Estimation• 1st: Algorithmic rework
• 2nd: Parallel implementation: multi/many core processors• Compared to R+MKL and CUDA implementations
Naive approach
for each evaluation_point e
for each sample s
d = distance(e,s)
e += density (d)
Our approach
B = computeBoundingBox()
for each sample s
b = fitBoundingBox(B,s)
for each e_point e in b
d = distance(e,s)
e += density (d)
5
Work @ UniMan
6
Join
Slide based on: Wu, Lisa, et al. "Navigating big data with high-throughput, energy-efficient data partitioning." Proc. of the 40th Annual International Symposium on Computer Architecture. ACM, 2013.
Do sunblock sales correlate with weather?Sales
Weather
Join-Date(Sales,Weather)
Join-Date
7
Join
•Join is everyday operation
8
Join
Goal: Develop a parallel implementation of relational join targeting nowadays heterogeneous systems
9
Heterogeneous systems
• Performance depends on the nature of the application
Multi-core•16 cores
•250 GFLOP/s
Many-core•61 cores
•1 TFLOP/s
GPU•2880 cores
•1.3 TFLOP/s
Complex control flow Number crunchingComplex control flow Number crunching
10
• Wide variety of programming environments in HPC• OpenMP, CUDA, MPI, TBB,…
• Our choice: OpenCL
Heterogeneous systems
NVIDIA SDKIntel SDKAMD SDK
Write once
Compile
Run many
11
Heterogeneous systems
• Cross-platform portability != Performance portability• OpenCL: Abstraction layer
• Solution 1: per-device hand-made tuning
• Not portable at all
• Solution 2: auto tuning
• Rely on performance models
12
Previous work
• Collection of performance modeling proposals for latest GPUs and Intel Xeon Phi
• Comprehensive analysis of the literature since ~2007• Organized as:
Unai Lopez-Novoa et al. A Survey of Performance Modeling and Simulation Techniques for Accelerator-based Computing IEEE Transactions on Parallel and Distributed Computing, DOI: 10.1109/TPDS.2014.2308216
Execution timeestimation
Bottleneckhighlighting
Power cons. estimation
Simulators
13
Types of Join
100
103
104
100
102
Inner Left Outer
Right Outer Full Outer
100 100 100 100
103 -
104 -
100 100
- 102
100 100
103 -
104 -
- 102
Table A
Table B
14
Algorithm• Biggest debate: Sort or Hash?
Hash-join
Complexity:
Limitation: Extensive use of atomics preventefficient parallelization
O(n + m)
Procedure: 1. Hash smaller table2. Scan larger table
Sort-join
Sorting increasescomplexity
O(n·log(n))
1. Sort keys2. Scan interleaved
15
Algorithm• Step 1: Sort keys in both tables
• Radix sort: speed/scalability sweet spot
100
104
103
103
102
100
100
102
101
102
100
100
102
103
103
104
100
101
102
102
Sort
16
Algorithm• Step 2: Merge
• Add non matching keys for outer joins
100
100
102
103
103
104
100
101
102
102
100 100
100 100
102 102
102 102
Table A Table B
Result – Inner Join
17
Implementation• Steps:
1)Develop a naive OpenCL implementation
2)Optimize per device type
3)Add a cost model for load balancing and partitioning
• Experimental setup:• M1: 4 (x2 SMT) Cores Xeon + Xeon Phi + 384 Cores GPU• M2: 12 (x2 SMT) Cores Xeon + Xeon Phi + 2496 Cores GPU
• Baseline: ModernGPU (CUDA)
18
Results
19
Per-device tuning• Optimizations:
• Thread scheduling• Memory management
• Overheads:
• Compilation• Memory allocation
20
Optimizations• Per device thread scheduling
OpenCLKernel
Threads:
Groups:
OpenCLDevices
Four core CPU
0 1 2 3
61 core Xeon Phi
21
2 3 4 600 1
• Per device memory management
Optimizations
Private Local GlobalOpenCL Device
Memory Hierarchy
Thread Thread-group Any thread
22
Scope:
Registers On-chip RAM
Registers RAM
RAMRegisters
RAM
RAM
Overheads• Compilation
• Online compilation: X% of runtime (without I/O)
• Memory allocation
• Intel SDK: Y % of Merge Step in Xeon Phi
OpenCLProgram
Host code Device code
Compilation: Offline (gcc) Online (SDK)
23
Results
24
Future work
1) Finish tuning per device code
2) Test join in FPGA
3) Revisit partitioning strategy
4) Support multi-device execution
• Develop a cost model that characterizes Join
• Split the workload in runtime among existing devices
25
Conclusions• Performance: device specific code• Performance portability:
a) Platform specific code
b) Parameterizable code
• High OpenCL SDK dependence• Only portable debugging tool: printf
• …but still the only portable framework• Future: OpenACC / OpenMP 4.0 ?
26
Harnessing OpenCL in modern coprocessors
Unai [email protected]
06 Aug 2014
Intelligent Systems Group
University of the Basque Country UPV/EHU