low hanging fruit...write faster code approximate calculations data layout compute flow i/o or...
TRANSCRIPT
Write faster
code
Approximate
calculations Data layout
Compute
flow
I/O or
Compute
bound?
Profile code
(80/20 rule)
Low hanging fruit
I/O bound?Process parallel
Loop parallel
Different quantity
Lower precision
pthreads, OpenMP Refactor Modify algorithmGet SSD
// parallel vectors
float metadata[N]; float metadata[N];
// parallel vectors
float metadata[N]; float metadata[N];
// parallel vectors
Data not contiguous in memory
Memory jumps in accessing data
leads to slow distance calculations
x1 y1 f1 … f1 f1 x2 y2 f2 … f2 f2
x1 y1 i2 j2 … … … … … … … … xn yn
f1 f1 f1 f1 … f2 f2 f2 f2 … f2 f3 f3 f3
Data is contiguous
Data layout matters
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
0 32 64 128 256 512 1024 2048 4096 8192
Slo
w d
ow
n f
act
or
Jump size (Bytes)
Language choice
prototypingshipping
readability
timeexisting software
memory
power
speedsecurity
hardware
dependencies
A simple benchmark
An algorithm that is
• well-understood
• not domain-specific
• computationally intensive
Computing Cholesky decomposition of A
A = L LT
simplifies the process of solving Ax = b
Python
Cython
Python
NumPy/SciPy
Numba
PyCUDA/
Scikit-CUDA
BLAS/
LAPACK
Numba-
CUDA
GPUCPU
Any algorithm
Standard algorithms
Speed
Effort
Python Cholesky implementations
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
100000
1000000
10000000
32 64 128 256 512 1024 2048 4096
Exe
cuti
on
tim
e (
ms)
Matrix size
Python
NumPy
Numba
np.linalg
sp.linalg
sp.linalg.lapack
skcuda
C++
C++
C++
SIMD
CUBLAS/CUDNN
BLAS/
LAPACK
Speed
Effort
GPUCPU
C++
CUDA
Compiler
options
Any algorithm
Standard algorithms
C++ Cholesky implementations (M= L LT)
0.01
0.1
1
10
100
1000
10000
100000
64 128 256 512 1024 2048 4096
Exe
cuti
on
tim
e (
ms)
Matrix size
CPP-O3
CPP-Fast-Math
BLAS (n=1)
AVX
C++ Cholesky implementations (M= L LT)
0.01
0.1
1
10
100
1000
10000
100000
64 128 256 512 1024 2048 4096
Exe
cuti
on
tim
e (
ms)
Matrix size
CPP-O3
CPP-Fast-Math
BLAS (n=1)
AVX
Eigen
LAPACK (n=1)
LAPACK
CUDA
Using CUDA from Python vs C++
0.1
1
10
100
1000
4 8 16 32 64 128 256 512 1024 2048 4096
Exe
cuti
on
tim
e (
ms)
Lo
g s
cale
Matrix size
CUDA
CUDA-compute
skcuda
skcuda-
compute
Using domain knowledge
SIMD implementation
• 4x less storage
• 8-12x faster feature computation
• 64x faster feature matching
Image courtesy of scikit-cuda docs
C++ optimization cycle
Eigen
BLAS
LAPACK
CUDA
OpenMP
Loop unrolling
Code bloat
Correct instructions
AVX/SSE/
Arm NEON
Domain knowledge
Approximations
Find hotspots
80/20 rule
Select candidates for optimization
End of general purpose H/W
Images from company web pages/press releases
Thank you for listening