gpu coder: automatic cuda and tensorrt code generation ... · automatic cuda and tensorrt code...
TRANSCRIPT
![Page 1: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/1.jpg)
1© 2018 The MathWorks, Inc.
GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB
Girish VenkataramaniArvind JayaramanJaya Shankar
![Page 2: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/2.jpg)
2
GPUs and CUDA programming
Ease of programming(expressivity, algorithmic, …)
Perfo
rman
ceeasier
faster
MATLAB
Python
CUDA
OpenCL
C/C++
GPUs are “hardware on steroids”, … but, programming them is hard
GPU Coder
![Page 3: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/3.jpg)
3
Consider an example: saxpy
Scalarized MATLAB
Vectorized MATLAB
Automatic compilation from an expressive language to a high-performance language
![Page 4: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/4.jpg)
4
Deep Learning applications
TensorRT is great for inference, … but, applications require more than inference
TensorRT
Traffic Sign Recognition
Deep learning
![Page 5: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/5.jpg)
5
GPU Coder is new technology released in September 2017
Neural NetworksDeep Learning, machine learning
Image Processing and Computer Vision
Image filtering, feature detection/extraction
Signal Processing and Communications
FFT, filtering, cross correlation,
5x faster than TensorFlow2x faster than mxnet
60x faster than CPUs for stereo disparity
20x faster than CPUs for FFTs
Accelerated implementation of parallel algorithms on GPUs
![Page 6: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/6.jpg)
6
Example: Lidar semantic segmentation
![Page 7: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/7.jpg)
7
Talk outline
Introduction
GPU Coder internals– Automatic parallelization
– Memory optimization
– Deep learning compilation
Application demo: Lidar processing in MATLAB using deep learning
![Page 8: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/8.jpg)
8
GPU Coder automatically extracts parallelism from MATLAB
1. Scalarized MATLAB (“for-all” loops)
2. Vectorized MATLAB(math operators and library functions)
3. Composite functions in MATLAB(maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT)
Infer CUDA kernels from MATLAB loops
Vendor library replacement
![Page 9: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/9.jpg)
9
static __global__ mykernel(A, X, Y, C, n){
int k = getThreadIndex(N);
int t = A[k] * X[k];C[k] = t + Y[k];
}
From a loop to a CUDA kernel
for k = 1:nt = A(k) .* X(k);C(k) = t + Y(k);
end
{ …mykernel<<< f(n) >>>(A, X, Y, C, n);
…}
Create kernel from loop bodyCompute kernel size
Y
f(n)
Classify kernel variables (input, output, local)
Ins: A, X, Y, nOuts: CLocal: t, k
Is this loop parallel?
Dependence analysis to understand the iteration space
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
![Page 10: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/10.jpg)
10
From a loop-nesting to CUDA kernels
for i = 1:pfor j = 1:m
for k = 1:n…(inner loop)…
endend
end
for i = 1:p…(outer prologue code)…for j = 1:m
for k = 1:n…(inner loop)…
end…(outer epilogue code)…
endend
Perfect LoopsImperfect Loops
Fundamental requirement: Loops need to be contiguous for parallelization
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
![Page 11: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/11.jpg)
11
From a loop-nesting to CUDA kernels
for i = 1:pfor j = 1:m
for k = 1:n… (outer prologue code) ……(inner loop)…if k == n
… (outer epilogue code) …end
endend
end
for i = 1:p…(outer prologue code)…for j = 1:m
for k = 1:n…(inner loop)…
end…(outer epilogue code)…
endend
Perfect LoopsImperfect Loops
Fundamental requirement: Loops need to be contiguous for parallelization
Loop Perfectization(if possible)
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
![Page 12: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/12.jpg)
12
for a = 1:Mfor b = 1:N
…(outer prologue code)…for c = 1:K
for d = 1:P…(inner loop)…
endend
endend
From a loop-nesting to CUDA kernels
Find parallel loops
Dependence analysis
Partition loop nesting
Heuristic may favor larger iteration space
for i = 1:Pfor a = 1:M
for b = 1:N…(inner loop)…
endendfor x = 1:K
for y = 1:L…(inner loop)…
endend
end
(K x P)
(P)
(MxN + KxL)
Example 1 Example 2
Fundamental requirement: Loops need to be contiguous for parallelization
(M x N)
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
![Page 13: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/13.jpg)
13
for a = 1:Mfor b = 1:N
…(outer prologue code)…for c = 1:K
for d = 1:P…(inner loop)…
endend
endend
From a loop-nesting to CUDA kernels
Create kernel for each partitionFind parallel loops
Dependence analysis
Partition loop nesting
Heuristic may favor larger iteration space
Use process from single loop conversion
for i = 1:Pfor a = 1:M
for b = 1:N…(inner loop)…
endendfor x = 1:K
for y = 1:L…(inner loop)…
endend
end
(K x P)
(P)
(MxN + KxL)
Example 1 Example 2
Fundamental requirement: Loops need to be contiguous for parallelization
(M x N)
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
![Page 14: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/14.jpg)
14
From vectorized MATLAB to CUDA kernels
output(:, 1) = (input(:, 1) – x_im) .* factor;
Loop Fusion
Create larger parallel loops (and hence CUDA kernels)
Scalarization
Reduce to for-loops
for i = 1:Mdiff(i) = input(i, 1) – x_im(i);
endfor a = 1:M
output(i, 1) = diff(i) * factor(i);end
for i = 1:Mdiff(i) = input(i, 1) – x_im(i);output(i, 1) = diff(i) * factor(i);
end
Assume the following sizes: ‘output’ : M x 3‘input’ : M x 3‘x_im’ : M x 1‘factor’ : M x 1
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
![Page 15: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/15.jpg)
15
From vectorized MATLAB to CUDA kernels
output(:, 1) = (input(:, 1) – x_im) .* factor;
Loop Fusion
Create larger parallel loops (and hence CUDA kernels)
Scalar Replacement
Reduce temp matrices to temp scalars
Scalarization
Reduce to for-loops
for i = 1:Mdiff(i) = input(i, 1) – x_im(i);
endfor a = 1:M
output(i, 1) = diff(i) * factor(i);end
for i = 1:Mdiff(i) = input(i, 1) – x_im(i);output(i, 1) = diff(i) * factor(i);
end
for i = 1:Mtmp = input(i, 1) – x_im(i);output(i, 1) = tmp * factor(i);
end
Assume the following sizes: ‘output’ : M x 3‘input’ : M x 3‘x_im’ : M x 1‘factor’ : M x 1
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
![Page 16: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/16.jpg)
16
From composite functions to optimized CUDA
• imfilter• imresize• imerode• imdilate• bwlabel• imwarp• …
• Deep learning inference (cuDNN, TensorRT)
• …
• Matrix multiply (cuBLAS)• Linear algebra (cuSolver)• FFT functions (cuFFT)• Convolution• …
Core math Image processingComputer vision
Neural Networks
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
Over 300+ MATLAB functions are optimized for CUDA code generation
![Page 17: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/17.jpg)
17
Talk outline
Introduction
GPU Coder internals– Automatic parallelization
– Memory optimization
– Deep learning compilation
Application demo: Lidar processing in MATLAB using deep learning
![Page 18: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/18.jpg)
18
Optimizing CPU-GPU data movement is a challengeA = ……for i = 1:N
… A(i)end…
imfilter…
A = ……
kernel1<<<…>>>( )
…
imfilter_kernel(…)
…
A = ……cudaMemcpyHtoD(gA, a);kernel1<<<…>>>(gA)cudaMemcpyDtoH(…)…cudaMemcpyHtoD(…) imfilter_kernel(…)cudaMemcpyDtoH(…)…
Where is the ideal placement of memcpy?
Naïve placement
K1
K2
K3
Optimized placement
K1
K2
K3
![Page 19: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/19.jpg)
19
GPU Coder optimizes memcpy placementA(:) = ….C(:) = ….
for i = 1:N….gB = kernel1(gA);gA = kernel2(gB);if (some_condition)
gC = kernel3(gA, gB);end….
end
…. = C;
cudaMemcpy*definitely* needed
cudaMemcpy*not* needed
cudaMemcpy*may be* needed
Observations• Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a
status flag per variable• Use-Def to determine where to insert memcpy
A(:) = …A_isDirtyOnCpu = true;…for i = 1:N
if (A_isDirtyOnCpu)cudaMemcpy(gA, A);A_isDirtyOnCpu = false;
endgB = kernel1(gA);gA = kernel2(gB);if (somecondition)
gC = kernel3(gA, gB);C_isDirtyOnGpu = true;
end…
end…if (C_isDirtyOnGpu)
cudaMemcpy(C, gC);C_isDirtyOnGpu = false;
end… = C;
Assume gA, gB and gC are mapped to GPU memory
Generated (pseudo) code
![Page 20: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/20.jpg)
20
GPU memory hierarchy is deep
Memory Visibility Heuristics/notes GPU Coder support
Global memory CPU + GPU Share data b/w CPU and GPU Yes
Local memory/registers per GPU thread Thread local data Yes
Shared memory per GPU block Shared data between threads Yes
Texture memory CPU + GPU Shared read-only data with 2D alignment Future
Constant memory GPU Read-only constants Yes
![Page 21: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/21.jpg)
21
Dotprod
Input image Conv. kernel Output image
rows
cols
kw
kh
GPU Coder automatically maps data to shared memory
GPU Coder automatically creates shared memory for many MATLAB image processing functions:
imfilter, imerode, imdilate, conv2, …
![Page 22: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/22.jpg)
22
GPU Coder runs a host of compiler transforms to generate CUDA
Control-flow graphIntermediate representation
(CFG – IR)
….
….
CUDA kernel optimizations
Front – end
Traditional compiler optimizations
MATLAB Library function mapping
Parallel loop creation
CUDA kernel creation
cudaMemcpy minimization
Shared memory mapping
CUDA code emission
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Loopoptimizations
![Page 23: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/23.jpg)
23
Demo: Stereo disparity
Left camera Right camera
Disparity map
Stereo disparity
![Page 24: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/24.jpg)
24
Easily re-target to Jetson and Drive platforms
Cross-compile for NVIDIA boards– Jetson boards– DrivePX2
Two small changes1. Change build-type to ‘lib’
2. Select cross-compile toolchain
![Page 25: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/25.jpg)
25
Talk outline
Introduction
GPU Coder internals– Automatic parallelization
– Memory optimization
– Deep learning compilation
Application demo: Lidar processing in MATLAB using deep learning
![Page 26: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/26.jpg)
26
Deep learning workflow in MATLAB
Train in MATLAB
Model importer
Trained DNN
Model importer
DNNdesign + training
Design in MATLAB Manage large image sets Automate data labeling Easy access to models
Training in MATLAB Acceleration with GPU’s Scale to clusters
![Page 27: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/27.jpg)
27
Deep learning workflow in MATLAB
Train in MATLAB
Model importer
Trained DNN
Application logic
Model importer
DNNdesign + training
Applicationdesign
![Page 28: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/28.jpg)
28
Deep learning workflow in MATLAB
Train in MATLAB
Model importer
Trained DNN
Application logic
Model importer
Applicationdesign
C++/CUDA + TensorRT
C++/CUDA + cuDNN
Standalone Deployment
DNNdesign + training
![Page 29: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/29.jpg)
29
Deep learning workflow in MATLAB
Train in MATLAB
Model importer
Trained DNN
Application logic
Model importer
C++/CUDA + TensorRT
C++/CUDA + cuDNN
Standalone Deployment
DNNdesign + training
Applicationdesign
![Page 30: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/30.jpg)
30
Traffic sign detection and recognition
Object detection
DNN
Strongest Bounding
Box
Classifier DNN
![Page 31: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/31.jpg)
31
C++/CUDA + TensorRT
C++/CUDA + cuDNN
C++/CUDA + TensorRT (int8)
GPU Coder allows for choice in deployment (cuDNN, TensorRT)
Application logic
![Page 32: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/32.jpg)
32
Performance summary (VGG-16) on TitanXP
0 50 100 150 200 250 300 350 400
MATLAB (cuDNN fp32)
GPU Coder (cuDNN fp32)
GPU Coder (TensorRT fp32)
GPU Coder (TensorRT int8)
![Page 33: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/33.jpg)
33
MATLAB and GPU Coder support state-of-art classification networks
GoogLeNetResNet50
Alexnet vs Squeezenet
Network # Layers Size Frame-rate (GPU Coder)
Alexnet 25 227 MB 500 FpsResNet50 177 96 MB 160 FpsGoogLeNet 144 27 MB 190 FpsSqueezenet 68 5 MB 615 Fps
![Page 34: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/34.jpg)
34
Alexnet Inference on NVIDIA Titan Xp
MATLAB GPU Coder +TensorRT 3.0.1MATLAB GPU Coder +cuDNN
Fram
es p
er s
econ
d
Batch SizeCPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
GPU Pascal Titan Xp
cuDNN v7
Testing platform
mxNet (1.1.0)
MATLAB GPU Coder +TensorRT 3.0.1 (int8)
TensorFlow (1.6.0)
![Page 35: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/35.jpg)
35
VGG-16 Inference on NVIDIA Titan Xp
MATLAB GPU Coder +TensorRT 3.0.1MATLAB GPU Coder +cuDNN
Fram
es p
er s
econ
d
Batch SizeCPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
GPU Pascal Titan Xp
cuDNN v7
Testing platform
mxNet (1.1.0)
MATLAB GPU Coder +TensorRT 3.0.1 (int8)
TensorFlow (1.6.0)
![Page 36: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/36.jpg)
36
Talk outline
Introduction
GPU Coder internals– Automatic parallelization
– Memory optimization
– Deep learning compilation
Application demo: Lidar processing in MATLAB using deep learning
![Page 37: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/37.jpg)
37
LiDAR Processing for Autonomous Vehicles
Design Deep learning-based LiDAR algorithm in MATLAB• Automate ground-truth labeling • Pre-process LiDAR data for training • GPU accelerated training
High Performance Inference
![Page 38: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/38.jpg)
38
Why Use LiDAR and Deep Learning for Autonomous Vehicles ?
Why use LiDAR ?– Provides accurate 3-D structure of scene– Required sensor for autonomous driving
Why use deep learning ?– Best accuracy– High-performance inference with GPU Coder
CarsGround
LiDAR Sematic Segmentation
Classifying Point Cloud Clusters
![Page 39: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/39.jpg)
39
What Does LiDAR Data Look Like ?
![Page 40: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/40.jpg)
40
Lidar processing application design is easy in MATLAB
Train in MATLAB
Model importer
Trained DNN
Application logic
Model importer
C++/CUDA + TensorRT
C++/CUDA + cuDNN
Standalone Deployment
DNNdesign + training
Applicationdesign
![Page 41: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/41.jpg)
41
Lidar processing application design is easy in MATLAB
Trained DNN
DNNdesign + training
Data prep, labeling
Training
Application logic
C++/CUDA + TensorRT
C++/CUDA + cuDNN
Standalone Deployment
Applicationdesign
![Page 42: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/42.jpg)
42
Data preparation and labeling of Lidar is a challenge
Trained DNN
DNNdesign + training
Accessing lidar data
TrainingLidar pre-processing
Labeling lidar data
![Page 43: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/43.jpg)
43
Access and Visualize LiDAR Data
Access Stored LiDAR Data Velodyne file I/O (pcap) Individual point clouds (.pcd,ply)
Visualize LiDAR Data Streaming LiDAR player Static point cloud display Point cloud differences
![Page 44: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/44.jpg)
44
LiDAR Preprocessing
ROI + Remove Ground• Fit plane using RANSAC
Cluster• Segment clusters using
Euclidean distance
![Page 45: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/45.jpg)
45
Ground Truth Labeling of LiDAR Data
![Page 46: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/46.jpg)
46
Ground Truth Labeling of LiDAR Data
![Page 47: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/47.jpg)
47
Ground Truth Labeling of LiDAR Data
![Page 48: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/48.jpg)
48
Organize Data for Training
Raw Point Cloud Data
Ground Truth Labels Transformed to Label Mask
Project to 2D
![Page 49: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/49.jpg)
49
Create Network Architecture
Easy API to create network structure
![Page 50: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/50.jpg)
50
Deployment using GPU Coder
C++/CUDA + TensorRT
C++/CUDA + cuDNN
![Page 51: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/51.jpg)
51
Deep learning workflow in MATLAB
Train in MATLAB
Model importer
Trained DNN
Application logic
Model importer
C++/CUDA + TensorRT
C++/CUDA + cuDNN
Standalone Deployment
DNNdesign + training
Applicationdesign
![Page 52: GPU Coder: Automatic CUDA and TensorRT code generation ... · Automatic CUDA and TensorRT code generation from MATLAB Girish Venkataramani Arvind Jayaraman Jaya Shankar. 2 GPUs and](https://reader034.vdocuments.net/reader034/viewer/2022050222/5f98859e6b2c800da921a889/html5/thumbnails/52.jpg)
52
Check Out Deep Learning in MATLAB and GPU Coder
GPU Coderhttps://www.mathworks.com/products/gpu-coder.html
Deep learning in MATLABhttps://www.mathworks.com/solutions/deep-learning.html
Deep learning On-Ramp: Free self-paced, online traininghttps://matlabacademy.mathworks.com/R2017b/portal.html?course=deeplearning