clblast: a tuned opencl blas library - iwocl · cedric nugteren, tomtom clblast: tuned opencl blas...
TRANSCRIPT
![Page 1: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/1.jpg)
CLBlast: A Tuned BLAS Library
Cedric Nugteren
May 16, 2018
http://github.com/cnugteren/clblast
http://cnugteren.github.io/clblast
![Page 2: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/2.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 2 out of 46
CLBlast?
![Page 3: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/3.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 3 out of 46
CLBlast?
![Page 4: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/4.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 4 out of 46
CLBlast?
![Page 5: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/5.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 5 out of 46
What do we use a BLAS library for?
![Page 6: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/6.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 6 out of 46
What do we use a BLAS library for?
BLAS: “Basic Linear Algebra Subpgrams”
![Page 7: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/7.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 7 out of 46
But why a new BLAS Library?
● NVIDIA’s cuBLAS is great, or is it?
![Page 8: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/8.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 8 out of 46
But why a new BLAS Library?
● NVIDIA’s cuBLAS is great, or is it?
![Page 9: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/9.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 9 out of 46
But why a new BLAS Library?
● NVIDIA’s cuBLAS is great, or is it?
– Not portable, not customisable, not open-source, ...
![Page 10: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/10.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 10 out of 46
But why a new BLAS Library?
● NVIDIA’s cuBLAS is great, or is it?
– Not portable, not customisable, not open-source, ...
● Is AMD’s clBLAS great?
– Not performance portable,
not well engineered, …
– Discontinued, superseeded
by incomplete ROCblas
![Page 11: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/11.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 11 out of 46
But why a new BLAS Library?
● NVIDIA’s cuBLAS is great, or is it?
– Not portable, not customisable, not open-source, ...
● Is AMD’s clBLAS great?
– Not performance portable,
not well engineered, …
– Discontinued, superseeded
by incomplete ROCblas
● Don’t vendors ship their
hardware with their own
libraries?
– Not a portable solution, and
actually not always true…
![Page 12: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/12.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 12 out of 46
Why is BLAS Important for ?
● HDMap making → Deep-learning
● Deep-learning → Fast BLAS libraries
![Page 13: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/13.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 13 out of 46
Introducing CLBlast
● CLBlast: Modern C++11 OpenCL BLAS library
● Implements all BLAS routines for all precisions (S, D, C, Z)
● Accelerates all kinds of applications:
– Fluid dynamics, quantum chemistry, linear algebra, finance, etc.
– Some extra focus on deep learning
![Page 14: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/14.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 14 out of 46
Introducing CLBlast
● CLBlast: Modern C++11 OpenCL BLAS library
● Implements all BLAS routines for all precisions (S, D, C, Z)
● Accelerates all kinds of applications:
– Fluid dynamics, quantum chemistry, linear algebra, finance, etc.
– Some extra focus on deep learning
● Already integrated into various projects:
– JOCLBlast (Java bindings)
– PyCLBlast (Python bindings)
– ArrayFire (GPU accelerated library and applications)
– OpenCL fork of Caffe (github.com/dividiti/ck-caffe)
– OpenCL fork of TF (github.com/hughperkins/tensorflow-cl)
![Page 15: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/15.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 15 out of 46
Introducing CLBlast
CI and
extensive testing
activity
community
![Page 16: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/16.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 16 out of 46
But… is it fast?
● All kernels are generic and tunable thanks to integration of the
CLTune auto-tuner (presented at MCSoC ‘15 and GTC ‘16)
![Page 17: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/17.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 17 out of 46
But… is it fast?
● All kernels are generic and tunable thanks to integration of the
CLTune auto-tuner (presented at MCSoC ‘15 and GTC ‘16)
![Page 18: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/18.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 18 out of 46
But… is it fast?
● All kernels are generic and tunable thanks to integration of the
CLTune auto-tuner (presented at MCSoC ‘15 and GTC ‘16)
● Tuned out-of-the-box for 50+ common devices– For new devices: run the auto-tuner when installing CLBlast
![Page 19: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/19.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 19 out of 46
CLBlast Benchmark Results
● Higher is better
● More results at http://cnugteren.github.io/clblast
AXPY
regular
(in GB/s)
AXPY
odd
(in GB/s)
GEMV
odd
(in GB/s)
GEMM
odd
(in GFLOPS)
GEMV
regular
(in GB/s)
GEMM
regular
(in GFLOPS)
![Page 20: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/20.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 20 out of 46
CLBlast on AMD Radeon HD7970
● On-par or better than clBLAS (especially for odd-sized GEMM)
![Page 21: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/21.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 21 out of 46
CLBlast on NVIDIA Titan X Pascal
● Much better than clBLAS, reasonably close to cuBLAS
![Page 22: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/22.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 22 out of 46
CLBlast on NVIDIA GeForce GTX 750Ti
● Much better than clBLAS, reasonably close to cuBLAS
![Page 23: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/23.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 23 out of 46
CLBlast on ARM Mali T628
● GEMM much better for CLBlast
![Page 24: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/24.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 24 out of 46
CLBlast on Intel Core i5-6200U CPU
● On-par or better than clBLAS
![Page 25: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/25.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 25 out of 46
CLBlast on Intel Skylake ULT GT2 GPU
● On-par or better than clBLAS (especially for GEMM)
![Page 26: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/26.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 26 out of 46
CLBlast on Intel Skylake ULT GT2 GPU
● On-par or better than clBLAS (especially for GEMM)
Intel GEMM update April ‘18
Library GFLOPS
clBLAS ~25
CLBlast v123 ~125
Intel’s GEMM snippet with suvgroup shuffling
~236
![Page 27: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/27.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 27 out of 46
CLBlast on Intel Skylake ULT GT2 GPU
● On-par or better than clBLAS (especially for GEMM)
Intel GEMM update April ‘18
Library GFLOPS
clBLAS ~25
CLBlast v123 ~125
CLBlast v124 with2D register tiling
~180
CLBlast v124 withIntel suvgroup shuffling
~230
Intel’s GEMM snippet with suvgroup shuffling
~236
![Page 28: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/28.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 28 out of 46
CLBlast for Deep Learning
● What can we do for the deep-learning community?
– Problem-specific tuning
– Half-precision floating-point (FP16)
– Batched routines
![Page 29: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/29.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 29 out of 46
Tuning Only for a Single Size?
● Default GEMM tuning:
– 1024x1024 matrices
● Deep-learning:
– Various but fixed matrix sizes (dependent on network layout)
– Typically smaller and/or rectangular matrices
![Page 30: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/30.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 30 out of 46
Tuning Only for a Single Size?
● Default GEMM tuning:
– 1024x1024 matrices
● Deep-learning:
– Various but fixed matrix sizes (dependent on network layout)
– Typically smaller and/or rectangular matrices
● Potential for optimal performance in CLBlast:
– Tuning for a custom size possible
– C++ API to change parameters at run-time
![Page 31: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/31.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 31 out of 46
Problem-Specific Tuning
● SGEMM tuning
for Radeon
M370X GPU
![Page 32: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/32.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 32 out of 46
Problem-Specific Tuning
● SGEMM tuning
for Radeon
M370X GPU
● Best on the
diagonal
● >100% due to
random tuning
![Page 33: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/33.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 33 out of 46
Problem-Specific Tuning
● SGEMM tuning
for Radeon
M370X GPU
● Best on the
diagonal
● >100% due to
random tuning
● Gain of ~2x for
some cases
default
![Page 34: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/34.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 34 out of 46
Half-precision floating-point (FP16)
● Double-precision (FP64) not needed for deep-learning
● Even FP32 is too much → introducing half-precision FP16
● Implemented in low-power devices (ARM Mali, Intel GPUs) and
deep-learning specific GPUs (Tesla P100, V100)
![Page 35: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/35.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 35 out of 46
Half-precision floating-point (FP16)
● Double-precision (FP64) not needed for deep-learning
● Even FP32 is too much → introducing half-precision FP16
● Implemented in low-power devices (ARM Mali, Intel GPUs) and
deep-learning specific GPUs (Tesla P100, V100)
● Potential for 2x savings in:
bandwidth, storage, compute, energy
![Page 36: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/36.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 36 out of 46
Half-precision floating-point (FP16)
● Double-precision (FP64) not needed for deep-learning
● Even FP32 is too much → introducing half-precision FP16
● Implemented in low-power devices (ARM Mali, Intel GPUs) and
deep-learning specific GPUs (Tesla P100, V100)
● Potential for 2x savings in:
bandwidth, storage, compute, energy
● Current FP16 support for GPUs:
– cuBLAS: HGEMM only
– clBLAS: no FP16 at all
– CLBlast: all routines!
![Page 37: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/37.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 37 out of 46
Half-precision FP16 on Intel Skylake GPU
● FP16 ~1.9x faster across the board!
FP32
FP16
clBLAS
![Page 38: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/38.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 38 out of 46
Half-precision FP16 on ARM Mali T628
● FP16 2+ times faster across the board!
FP32
FP16
clBLAS
![Page 39: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/39.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 39 out of 46
Batching BLAS routines
● Small-sized GEMM is super slow
– Not enough work-groups
– Not enough threads
![Page 40: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/40.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 40 out of 46
Batching BLAS routines
● Small-sized GEMM is super slow
– Not enough work-groups
– Not enough threads
● Let’s make it fast again:
– Combine multiple small GEMM operations into a single kernel
– Use offsets to indicate where the next matrices start
![Page 41: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/41.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 41 out of 46
Batched GEMM on GeForce GTX 750Ti
● SGEMM 128x128x128:
– Regular: ~60 GFLOPS
– Batched: ~20 GFLOPS (1 GEMM) up to ~500 GFLOPS (4K)!
![Page 42: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/42.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 42 out of 46
Batched GEMM on GeForce GTX 750Ti
● Significant benefits for larger sizes as well
– mostly beneficial in the range n=64 till 512
8 GEMMs
64 GEMMs
![Page 43: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/43.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 43 out of 46
Batched strided GEMM on GeForce GTX 750Ti
● Strided (left) versus regular (right):
– More assumptions, smaller overhead
– Better performance for smaller batches
![Page 44: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/44.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 44 out of 46
What’s next?
● More features for deep learning (convolution as GEMM, ...)
● Auto-tuning using learned models:
– Similar to the ‘ISAAC’ library
– Cooperation with:
● Rafael Ballester-Ripoll (University of Zürich)
– Based on tensor trains
● Flavio Vella (dividiti)
– Matrix-size aware tuning
● Suggestions?
![Page 45: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/45.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 45 out of 46
Conclusion
● CLBlast: a modern C++11 OpenCL BLAS library
● Performance portable thanks to generic kernels and auto-tuning
● Has important features to accelerate deep-learning:
– Problem-size specific tuning:
● Up to 2x in an example experiment
– Half-precision FP16 support:
● Up to 2x benefit in speed and memory savings
– Batched GEMM routines:
● Order of magnitude benefit depending on the use-case
![Page 46: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/46.jpg)
CLBlast: Tuned OpenCL BLASCedric Nugteren, TomTom Slide 46 out of 46
CLBlast
● Time to checkout the album…. Or clone the repository!
![Page 47: CLBlast: A Tuned OpenCL BLAS Library - IWOCL · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 35 out of 46 Half-precision floating-point (FP16) Double-precision (FP64)](https://reader034.vdocuments.net/reader034/viewer/2022050204/5f57afedeae5ca7e39428428/html5/thumbnails/47.jpg)
CLBlast: A Tuned BLAS Library
Cedric Nugteren
May 16, 2018
http://github.com/cnugteren/clblast
http://cnugteren.github.io/clblast