harnessing opencl in modern coprocessors

Harnessing OpenCL in modern coprocessors

Unai [email protected]

06 Aug 2014

Intelligent Systems Group

University of the Basque Country UPV/EHU

Outline

• Previous work

• Work @ UniMan: Relational Join

1.Motivation

2.Algorithm

3.Results

4.Conclusions

2

About Myself• PhD Student @ Intelligent Systems Group: 2011 – Now

• Research interest: Efficient use of Modern coprocessors• Performance modeling• Code acceleration

• Development of parallel implementations• Molecular Dynamics simulation code (MSc thesis)• Kernel Density Estimation (Under review)• Relational Join (Work @ UniMan)

3

Kernel Density Estimation• Estimate the Probability Density Function of a population

• Our use case: Climate models

• Challenge: large volumes of data

4

Histogram: KDE:

Kernel Density Estimation• 1st: Algorithmic rework

• 2nd: Parallel implementation: multi/many core processors• Compared to R+MKL and CUDA implementations

Naive approach

for each evaluation_point e

for each sample s

d = distance(e,s)

e += density (d)

Our approach

B = computeBoundingBox()

for each sample s

b = fitBoundingBox(B,s)

for each e_point e in b

d = distance(e,s)

e += density (d)

5

Work @ UniMan

6

Join

Slide based on: Wu, Lisa, et al. "Navigating big data with high-throughput, energy-efficient data partitioning." Proc. of the 40th Annual International Symposium on Computer Architecture. ACM, 2013.

Do sunblock sales correlate with weather?Sales

Weather

Join-Date(Sales,Weather)

Join-Date

7

Join

•Join is everyday operation

8

Join

Goal: Develop a parallel implementation of relational join targeting nowadays heterogeneous systems

9

Heterogeneous systems

• Performance depends on the nature of the application

Multi-core•16 cores

•250 GFLOP/s

Many-core•61 cores

•1 TFLOP/s

GPU•2880 cores

•1.3 TFLOP/s

Complex control flow Number crunchingComplex control flow Number crunching

10

• Wide variety of programming environments in HPC• OpenMP, CUDA, MPI, TBB,…

• Our choice: OpenCL


NVIDIA SDKIntel SDKAMD SDK

Write once

Compile

Run many

11


• Cross-platform portability != Performance portability• OpenCL: Abstraction layer

• Solution 1: per-device hand-made tuning

• Not portable at all

• Solution 2: auto tuning

• Rely on performance models

12

Previous work

• Collection of performance modeling proposals for latest GPUs and Intel Xeon Phi

• Comprehensive analysis of the literature since ~2007• Organized as:

Unai Lopez-Novoa et al. A Survey of Performance Modeling and Simulation Techniques for Accelerator-based Computing IEEE Transactions on Parallel and Distributed Computing, DOI: 10.1109/TPDS.2014.2308216

Execution timeestimation

Bottleneckhighlighting

Power cons. estimation

Simulators

13

Types of Join

100

103

104

100

102

Inner Left Outer

Right Outer Full Outer

100 100 100 100

103 -

104 -

100 100

- 102

100 100

103 -

104 -

- 102

Table A

Table B

14

Algorithm• Biggest debate: Sort or Hash?

Hash-join

Complexity:

Limitation: Extensive use of atomics preventefficient parallelization

O(n + m)

Procedure: 1. Hash smaller table2. Scan larger table

Sort-join

Sorting increasescomplexity

O(n·log(n))

1. Sort keys2. Scan interleaved

15

Algorithm• Step 1: Sort keys in both tables

• Radix sort: speed/scalability sweet spot

100

104

103

103

102

100

100

102

101

102

100

100

102

103

103

104

100

101

102

102

Sort

16

Algorithm• Step 2: Merge

• Add non matching keys for outer joins

100

100

102

103

103

104

100

101

102

102

100 100

100 100

102 102

102 102

Table A Table B

Result – Inner Join

17

Implementation• Steps:

1)Develop a naive OpenCL implementation

2)Optimize per device type

3)Add a cost model for load balancing and partitioning

• Experimental setup:• M1: 4 (x2 SMT) Cores Xeon + Xeon Phi + 384 Cores GPU• M2: 12 (x2 SMT) Cores Xeon + Xeon Phi + 2496 Cores GPU

• Baseline: ModernGPU (CUDA)

18

Results

19

Per-device tuning• Optimizations:

• Thread scheduling• Memory management

• Overheads:

• Compilation• Memory allocation

20

Optimizations• Per device thread scheduling

OpenCLKernel

Threads:

Groups:

OpenCLDevices

Four core CPU

0 1 2 3

61 core Xeon Phi

21

2 3 4 600 1

• Per device memory management

Optimizations

Private Local GlobalOpenCL Device

Memory Hierarchy

Thread Thread-group Any thread

22

Scope:

Registers On-chip RAM

Registers RAM

RAMRegisters

RAM

RAM

Overheads• Compilation

• Online compilation: X% of runtime (without I/O)

• Memory allocation

• Intel SDK: Y % of Merge Step in Xeon Phi

OpenCLProgram

Host code Device code

Compilation: Offline (gcc) Online (SDK)

23

Results

24

Future work

1) Finish tuning per device code

2) Test join in FPGA

3) Revisit partitioning strategy

4) Support multi-device execution

• Develop a cost model that characterizes Join

• Split the workload in runtime among existing devices

25

Conclusions• Performance: device specific code• Performance portability:

a) Platform specific code

b) Parameterizable code

• High OpenCL SDK dependence• Only portable debugging tool: printf

• …but still the only portable framework• Future: OpenACC / OpenMP 4.0 ?

26

Harnessing OpenCL in modern coprocessors

Unai [email protected]

06 Aug 2014

Intelligent Systems Group

University of the Basque Country UPV/EHU

harnessing opencl in modern coprocessors

Engineering