cnn inference with cudnn - nvidia...2 agenda introduction to cudnn cudnn best practices: memory...
TRANSCRIPT
![Page 1: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/1.jpg)
Chris Hebert, Sven Middelberg, March 21, 2019
CNN INFERENCE WITH cuDNN
![Page 2: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/2.jpg)
2
AGENDA
Introduction to cuDNN
cuDNN Best Practices:
Memory Management Done Right
Choosing the Right Convolution Algorithm & Tensor Layout
Tensor Cores: Low Precision Inference at Speed of Light
From Research to Production: It just works ... or not?!
Summary
![Page 3: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/3.jpg)
3
cuDNN ?
cuDNN Provides a set of common network operations
Convolution
Activation
Tensor Ops – Add, multiply etc
Highly optimized for respective HW architectures
cuDNN is the backend for most DL frameworks that target NVIDIA Hardware
![Page 4: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/4.jpg)
4
cudnnHandle_t ctx;cudnnCreate(&ctx);
cudnnTensorDescriptor_t in_desc; cudnnCreateTensorDescriptor(&in_desc);cudnnSetTensorDescriptor(in_desc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, 1920, 1080, 4);// Same for out_desc
cudnnFilterDescriptor_t filter_desc;cudnnCreateFilterDescriptor(&filter_desc);cudnnSetFilter4dDescriptor(filter_desc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, 8, 4, 3, 3);
cudnnConvolutionDescriptor_t conv_desc;cudnnCreateConvolutionDescriptor(&conv_desc);cudnnSetConvolution2dDescriptor(conv_desc, 1, 1, 2, 2, 1, 1, CUDNN_CONVOLUTION, CUDNN_DATA_FLOAT);cudnnConvolutionFwdAlgo_t conv_alg = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM;
size_t workspace_size; void* workspace;cudnnGetConvolutionForwardWorkspaceSize(ctx, in_desc, filter_desc, conv_desc, out_desc, conv_alg, &workspace_size);cudaMalloc(&workspace, workspace_size);
// Allocate and initialize device data ...
float alpha = 1.0f; float beta = 0.0f;cudnnStatus_t status = cudnnConvolutionForward(ctx, &alpha, in_desc, in_dev, filter_desc, weights_dev, conv_desc, conv_alg, workspace, workspace_size,
&beta, out_desc, out_dev);
Context Management Input and Output Tensors Convolution Filter Convolution Descriptor & Algorithm Convolution Workspace Convolution Computation
HELLO CONVOLUTION
![Page 5: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/5.jpg)
5
FULLY-CONNECTED LAYERS
cuDNN has no native support for fully-connected layers
Forward pass of fully-connected layers is basically a matrix vector multiply
cuBLAS
a0
a2
a1
b0
b1
w2
w0
w1
v2
v1
v0
𝐯𝐯𝟎𝟎 𝐯𝐯𝟏𝟏 𝐯𝐯𝟐𝟐𝐰𝐰𝟎𝟎 𝐰𝐰𝟏𝟏 𝐰𝐰𝟐𝟐
∗𝐚𝐚𝟎𝟎𝐚𝐚𝟏𝟏𝐚𝐚𝟐𝟐
= 𝐛𝐛𝟎𝟎𝐛𝐛𝟏𝟏
![Page 6: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/6.jpg)
6
cuDNN PROFILING
cudaEvent_t start, stop;cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start);// Do something on GPUcudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms;cudaEventElapsedTime(&ms, start, stop);
Profiling Tools
NVIDIA Nsight Systems / NVVP
nvprof
NVIDIA Nsight ComputeCustom CUDA kernel profiling
GPU timing using CUDA events:
![Page 7: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/7.jpg)
7
TensorRT cuDNN
TensorRT & cuDNN
![Page 8: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/8.jpg)
8
TensorRT cuDNN
High: Uses cuDNN internally Level of abstraction Low: Basic DL primitives
Low: Only custom layers must be implemented manually
Programming effort High: Network must be assembled manually
TensorRT & cuDNN
![Page 9: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/9.jpg)
9
TensorRT cuDNN
High: Uses cuDNN internally Level of abstraction Low: Basic DL primitives
Low: Only custom layers must be implemented manually
Programming effort High: Network must be assembled manually
High: Allows to import models from many training frameworks
Compatibility supportfor other APIs
None: Models must be imported manually
TensorRT & cuDNN
![Page 10: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/10.jpg)
10
TensorRT cuDNN
High: Uses cuDNN internally Level of abstraction Low: Basic DL primitives
Low: Only custom layers must be implemented manually
Programming effort High: Network must be assembled manually
High: Allows to import models from many training frameworks
Compatibility supportfor other APIs
None: Models must be imported manually
Low: Fully-automatic inference optimization
Optimization effort High: Needs extensive profiling and optimization
TensorRT & cuDNN
![Page 11: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/11.jpg)
11
TensorRT cuDNN
High: Uses cuDNN internally Level of abstraction Low: Basic DL primitives
Low: Only custom layers must be implemented manually
Programming effort High: Network must be assembled manually
High: Allows to import models from many training frameworks
Compatibility supportfor other APIs
None: Models must be imported manually
Low: Fully-automatic inference optimization
Optimization effort High: Needs extensive profiling and optimization
Good Performance Potentially better(depending on effort)
TensorRT & cuDNN
![Page 12: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/12.jpg)
12
TensorRT cuDNN
High: Uses cuDNN internally Level of abstraction Low: Basic DL primitives
Low: Only custom layers must be implemented manually
Programming effort High: Network must be assembled manually
High: Allows to import models from many training frameworks
Compatibility supportfor other APIs
None: Models must be imported manually
Low: Fully-automatic inference optimization
Optimization effort High: Needs extensive profiling and optimization
Good Performance Potentially better(depending on effort)
Low: Re-optimization needed for any change in architecture or input dimensions
Mutability High: Usually requires very few optimizations after reasonable changes in input / architecture
TensorRT & cuDNN
![Page 13: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/13.jpg)
13
AGENDA
Introduction to cuDNN
cuDNN Best Practices:
Memory Management Done Right
Choosing the Right Convolution Algorithm & Tensor Layout
Tensor Cores: Low Precision Inference at Speed of Light
From Research to Production: It just works ... or not?!
Summary
![Page 14: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/14.jpg)
14
AUTOENCODER IN CAFFEDeep Image Matting, Chris Hebert, GTC‘18
160ms
![Page 15: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/15.jpg)
15
AUTOENCODER IN CAFFEDeep Image Matting, Chris Hebert, GTC‘18
160ms
![Page 16: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/16.jpg)
16
CUDNN DEVICE MEMORY MANAGEMENT
Don‘ts
Be wasteful with allocations
Interleaving allocations with kernel execution
Memcpy layer weights on-the-fly,one H2D-copy per layer per forward pass
Copy synchroneously from pageable host memory
Dos
Maximize reuse
Allocate once at startup, use every forward pass
Initialize layer weights once, copy only the input and output tensors per forward pass
Copy asynchroneously from page-locked host memory, maximize overlap and resource utilization
Minimize Footprint - Maximize Performance
![Page 17: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/17.jpg)
17
INITIALIZING WEIGHTS AT STARTUPDeep Image Matting, Chris Hebert, GTC‘18
Overall weights data transferred for each inference is ~89 MB ⇒ 8.7 ms @10Gb/s
Allocate and initialize once at startup, reuse each inference!
25 MB 12.5 MB9 MB
![Page 18: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/18.jpg)
18
Combined size of all output tensors is 141.8 MB
Only two tensors used at a time: input & output tensor
Allocate two times the maximum tensor size: 50 MB
TENSOR DEVICE MEMORYDeep Image Matting, Chris Hebert, GTC‘18
0
5
10
15
20
25
30
Tens
or S
ize
[MB]
Tensor Size [MB]
A 25mb
B 25mb
![Page 19: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/19.jpg)
19
MINIMIZING MEMORY FOOTPRINT“Ping-Pong” Tensor Memory
A 25mb
B 25mb
Memory Pool2x Largest Tensor
Conv1_1
Conv1_2
Conv1_3
Pool1
A
B
A
BDoesn‘t work for cached tensors, e.g., skip links!
![Page 20: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/20.jpg)
20
MINIMIZING MEMORY FOOTPRINT
Size of a convolution workspace varies, depending on multiple parameters:
• input and output tensor dimensions
• Precisions
• Convolution algorithm
• ...
But: workspace can be shared among layers
Allocate maximum workspace size!
Workspace Memory
![Page 21: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/21.jpg)
21
OPTIMIZING RESOURCE UTILIZATION
Use page-locked host memory for H2D and D2H memcpys:
cudaMallocHost(…) instead of malloc(…)
Use asynchronous memcpys:
cudaMemcpyAsync(…, stream) instead of cudaMemcpy(…)
Use multiple CUDA streams to overlap memcpys and kernel executions:
Maximum Inference Throughput by Asynchronous Copies
0 1 2 3 4 5 6 70 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
H2D copy of input tensor Forward pass kernels
D2H copy of output tensor
![Page 22: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/22.jpg)
22
AUTOENCODER IN CAFFEDeep Image Matting, Chris Hebert, GTC‘18
160ms
![Page 23: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/23.jpg)
23
AUTOENCODER IN cuDNNDeep Image Matting, Chris Hebert, GTC‘18
17msStart-up time (one time overhead) 41ms
![Page 24: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/24.jpg)
24
AGENDA
Introduction to cuDNN
cuDNN Best Practices:
Memory Management Done Right
Choosing the Right Convolution Algorithm & Tensor Layout
Tensor Cores: Low Precision Inference at Speed of Light
From Research to Production: It just works ... or not?!
Summary
![Page 25: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/25.jpg)
25
CONVOLUTION ALGORITHMS128x128x128x128 convolution, FP32, NCHW, Quadro GV100
CUDNN_CONVOLUTION_FWD_ALGO_ …3 x 3 11 x 11
Performance Workspace Performance Workspace
CUDNN_CONVOLUTION_FWD_ALGO_GEMM
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
CUDNN_CONVOLUTION_FWD_ALGO_FFT
CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED
![Page 26: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/26.jpg)
26
CONVOLUTION ALGORITHMS128x128x128x128 convolution, FP32, NCHW, Quadro GV100
CUDNN_CONVOLUTION_FWD_ALGO_ …3 x 3 11 x 11
Performance Workspace Performance Workspace
CUDNN_CONVOLUTION_FWD_ALGO_GEMM 0.76 ms 72 MB
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM 0.62 ms None
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM 0.47 ms 0.01 MB
CUDNN_CONVOLUTION_FWD_ALGO_FFT 45.3 ms 8322 MB
CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING 3.69 ms 70 MB
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD 0.26 ms 1.56 MB
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED 2.73 ms 578 MB
![Page 27: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/27.jpg)
27
CONVOLUTION ALGORITHMS128x128x128x128 convolution, FP32, NCHW, Quadro GV100
CUDNN_CONVOLUTION_FWD_ALGO_ …3 x 3 11 x 11
Performance Workspace Performance Workspace
CUDNN_CONVOLUTION_FWD_ALGO_GEMM 0.76 ms 72 MB 8.47 ms 968 MB
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM 0.62 ms None 6.82 ms None
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM 0.47 ms 0.01 MB 6.58 ms 0.01 MB
CUDNN_CONVOLUTION_FWD_ALGO_FFT 45.3 ms 8322 MB 44.7 ms 8328 MB
CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING 3.69 ms 70 MB 5.13 ms 70 MB
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD 0.26 ms 1.56 MB Unsupported
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED 2.73 ms 578 MB Unsupported
![Page 28: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/28.jpg)
28
CONVOLUTION ALGORITHMS
Choice depends on memory and performance requirements
Some algorithms not supported for certain convolution configurations
Chosing the most suitable and supported algorithm, layer by layer, is a tedious job
cudnnGetConvolutionForwardAlgorithm_v7(...)
Based on heuristics: List of algorithms sorted by expected runtime
cudnnFindConvolutionForwardAlgorithm(...)
More accurate results based on exhaustive experiments
![Page 29: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/29.jpg)
29
TENSOR LAYOUTNCHW vs NHWC
Input Tensor Size Output Tensor Size Filter Size NCHW NHWC
32 x 32 x 64 16 x 16 x 128 3 x 3 0.05 ms 0.06 ms
128 x 128 x 128 128 x 128 x 128 3 x 3 0.26 ms 0.50 ms
512 x 512 x 32 256 x 256 x 64 5 x 5 0.56 ms 1.06 ms
1920 x 1080 x 3 1920 x 1080 x 32 5 x 5 0.97 ms 1.36 ms
16 x 16 x 128 8 x 8 x 256 7 x 7 0.40 ms 0.41 ms
1920 x 1080 x 4 960 x 540 x 32 9 x 9 1.22 ms 1.34 ms
128 x 128 x 128 128 x 128 x 128 11 x 11 5.13 ms 5.72 ms
Convolution algorithm selected using cudnnFindConvolutionForwardAlgorithm(...)
![Page 30: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/30.jpg)
30
TENSOR LAYOUT
Usually NCHW is faster than NHWC for tensor data
Might be reasonable to pre-convert NHWC input tensor to NCHW and back after the inference to achieve optimal throughput
⇒ Profile!
One exception ...
![Page 31: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/31.jpg)
31
AGENDA
Introduction to cuDNN
cuDNN Best Practices:
Memory Management Done Right
Choosing the Right Convolution Algorithm & Tensor Layout
Tensor Cores: Low Precision Inference at Speed of Light
From Research to Production: It just works ... or not?!
Summary
![Page 32: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/32.jpg)
32
LOW PRECISION INFERENCE
In most cases FP16 / half provides more than adequate precision for image processing
Volta and Turing have hardware for FAST fp16 – TRUE_HALF_CONFIG
On Pascal and below, store in fp16 but process in fp32 – PSEUDO_HALF_CONFIG
Given an FP32 model, simply converting the weights to FP16 often retains decent quality
For best results ⇒ Retrain with FP16 precision
![Page 33: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/33.jpg)
33
LOW PRECISION INFERENCEDeep Image Matting, Chris Hebert, GTC‘18
0
0.5
1
1.5
2
320x320 640x640 960x960 1280x1280
Spee
dup
Input SizeFP32 FP16
![Page 34: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/34.jpg)
34
LOW PRECISION INFERENCEDeep Image Matting, Chris Hebert, GTC‘18
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
320x320 640x640 960x960 1280x1280
Mem
ory
Dem
ands
Input SizeFP32 FP16
![Page 35: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/35.jpg)
35
TENSOR CORES ON VOLTA & TURING
Tensor Cores perform FP16 matrix multiply accumulate (HMMA)
Turing also supports INT8 and INT4
Only two algorithms supported:
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
Number of Input and output channels must be multiple of eight!
Convolution math type must be set to CUDNN_TENSOR_OP_MATH
And it will be much MUCH faster!
![Page 36: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/36.jpg)
36
LOW PRECISION INFERENCEDeep Image Matting, Chris Hebert, GTC‘18
0
1
2
3
4
5
320x320 640x640 960x960 1280x1280
Spee
dup
Input SizeFP32 FP16 FP16 + TC
![Page 37: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/37.jpg)
37
LOW PRECISION INFERENCEDeep Image Matting, Chris Hebert, GTC‘18
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
320x320 640x640 960x960 1280x1280
Mem
ory
Dem
ands
Input SizeFP32 FP16 FP16 + TC
![Page 38: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/38.jpg)
38
TENSOR CORES ON VOLTA AND TURINGNCHW vs NHWC
Input Tensor Size Output Tensor Size Filter Size NCHW NHWC
32 x 32 x 64 16 x 16 x 128 3 x 3 0.05 ms 0.04 ms
128 x 128 x 128 128 x 128 x 128 3 x 3 0.11 ms 0.08 ms
512 x 512 x 32 256 x 256 x 64 5 x 5 0.25 ms 0.15 ms
1920 x 1080 x 8 1920 x 1080 x 32 5 x 5 3.00 ms 2.31 ms
16 x 16 x 128 8 x 8 x 256 7 x 7 0.26 ms 0.23 ms
128 x 128 x 128 128 x 128 x 128 7 x 7 0.37 ms 0.34 ms
800 x 800 x 8 400 x 400 x 8 9 x 9 1.20 ms 2.53 ms
Convolution algorithm selected using cudnnFindConvolutionForwardAlgorithm(...)
![Page 39: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/39.jpg)
39
CHANNEL PADDING
Some tensors might don‘t have a channel count that is a multiple of eight, e.g., three-channel RGB input tensors
Cannot use Tensor Core acceleration
512x512x3 → 512x512x32 convolution, 3x3 kernel, FP16, NHWC, GV100: 0.84 ms
512x512x8 → 512x512x32 convolution, 3x3 kernel, FP16, NHWC, GV100: 0.18 ms
Pad input tensor, zero-pad filter weights
![Page 40: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/40.jpg)
40
Tensor Core convolution performance varies with channel count
Both of these 3x3 convolutions uses input and output tensors that have the same size:
2048 x 2048 x 8 → 2048 x 2048 x 8 ⇒ 2.01 ms 24 MB
1024 x 1024 x 32 → 1024 x 1024 x 32 ⇒ 0.6 ms 6 MB
Fold 2x2xN slices into 1x1x4N slices
Increases receptive field
Requires re-training
CHANNEL FOLDING
![Page 41: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/41.jpg)
41
AGENDA
Introduction to cuDNN
cuDNN Best Practices:
Memory Management Done Right
Choosing the Right Convolution Algorithm & Tensor Layout
Tensor Cores: Low Precision Inference at Speed of Light
From Research to Production: It just works ... or not?!
Summary
![Page 42: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/42.jpg)
42
Research To ProductionWhat do we need?
(e.g.) TensorFlow, PyTorch
Network Definition
Network Parameters
![Page 43: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/43.jpg)
43
Research To ProductionWhat do we need?
(e.g.) TensorFlow, PyTorch
Network Definition
Network Parameters
A few ways to do this.
![Page 44: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/44.jpg)
44
Research To Production
Most checkpoint/model file formats based on Protocol Buffers from Google
Check em out, they’re awesome.
Message format defined in the .proto file
Compiled with the Protocol Buffer Compiler
Manipulate the contents with the Protocol Buffer API
Good tutorials for this at
https://developers.google.com/protocol-buffers/docs/cpptutorial
C++ Parse The Protocol Buffers
![Page 45: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/45.jpg)
45
Research To Production
PyTorch example (simplified)
Load the model in Python, open output file for writing
Write the params and arch straight from Python
……head_0_spade_0_mlp_shared_0,Weights, 176198144,2342400, 128,183,5,5head_0_spade_0_mlp_shared_0,biases, 176197632,512, 1,1,1,128……
Tensor Name Offset,Size Shape
![Page 46: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/46.jpg)
46
Research To Production
PyTorch example (simplified)
Load the model in Python, open output file for writing
Write the model architecture straight from Python
input_file_path = <Path to Pytorch checkpoint>
ckpnt = torch.load(input_file_path,map_location="cpu")
out_path = “netG_params.txt".format(item_key)
with open(out_path,"w") as f:
![Page 47: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/47.jpg)
47
Research To Production
PyTorch example (simplified)
Iterate the model, find the weights and biases
Write the model architecture straight from Python
input_file_path = <Path to Pytorch checkpoint>
ckpnt = torch.load(input_file_path,map_location="cpu")
out_path = “netG_params.txt".format(item_key)
with open(out_path,"w") as f:
for model_key in ckpnt :if model_key.find("weight") > 0 or model_key.find("bias") > 0:
cur_var = Variable(ckpnt[model_key])var_size = cur_var.size() size_len = len(var_size)
![Page 48: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/48.jpg)
48
Research To Production
PyTorch example (simplified)
Replace ‘.’ with ‘_’ (personal preference)
Write the model architecture straight from Python
tensor_name = model_key.replace(".","_")tensor_type = “”if model_key.find("weight") > 0:
tensor_type = "Weights"tensor_name = tensor_name.replace("_weight","")
if model_key.find("bias") > 0:tensor_type = "biases"tensor_name = tensor_name.replace("_bias","")
tensor_dims = ""tensor_total_size = 1
![Page 49: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/49.jpg)
49
Research To Production
PyTorch example (simplified)
Record the tensor shape in a consistent manner
Write the model architecture straight from Python
if size_len < 4:size_len_delta = 4-size_lenfor s in range(size_len_delta):
tensor_dims += ",1"
for s in range(size_len):tensor_dims += ",{}".format(var_size[s])tensor_total_size *= var_size[s]
tensor_total_file_size = tensor_total_size * 4
tensor_size_data = ",{},{}".format(tensor_offset,tensor_total_file_size)
![Page 50: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/50.jpg)
50
Research To Production
PyTorch example (simplified)
Write the name, offset, size and shape to the text file.
Write the model architecture straight from Python
tensor_total_file_size = tensor_total_size * 4
tensor_size_data = ",{},{}".format(tensor_offset,tensor_total_file_size)tensor_offset += tensor_total_file_sizetensor_name += ",{}".format(tensor_type)f.write(tensor_name)f.write(tensor_size_data)f.write(tensor_dims)f.write("\n")
![Page 51: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/51.jpg)
51
Research To Production
PyTorch example (simplified)
Load the model in Python, open output file for writing
Write the params and arch straight from Python
……head_0_spade_0_mlp_shared_0,Weights, 176198144,2342400, 128,183,5,5head_0_spade_0_mlp_shared_0,biases, 176197632,512, 1,1,1,128……
Tensor Name Offset,Size Shape
![Page 52: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/52.jpg)
52
Research To Production
PyTorch example (simplified)
Load the model in Python, open output file for writing
Write the params and arch straight from Python
……head_0_spade_0_mlp_shared_0,Weights, 176198144,2342400, 128,183,5,5head_0_spade_0_mlp_shared_0,biases, 176197632,512, 1,1,1,128……
Tensor Name Offset,Size Shape
out,in,filter H/W
![Page 53: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/53.jpg)
53
Research To Production
PyTorch example (simplified)
Similar loop as before but extract the weights from the .data member and write to a single file
Write the model params straight from Python
(iterate model as before)
data = ckpt[model_key]cur_var = Variable(data)var_size = cur_var.size()np_data = data.cpu().numpy()f.write(np_data.tobytes())
![Page 54: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/54.jpg)
54
Research To Production
Scientists express their models in an algebraically correct manner.
That’s their job.
But algebraically correct does not necessarily mean performant.
Engineers need to identify when an algorithm can be restructured for performance.
That’s our job.
Scientist vs Engineer
![Page 55: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/55.jpg)
55
Research To Production
Example 1. Wx+b when you ONLY want the bias.
Scientist vs Engineer
![Page 56: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/56.jpg)
56
Research To Production
Example 1. Wx+b when you ONLY want the bias.
Scientist vs Engineer
![Page 57: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/57.jpg)
57
Research To Production
Example 1. Wx+b when you ONLY want the bias.
In this particular case:
Write custom kernel to write the bias values.
And if possible fuse with previous and/or next step.
Scientist vs Engineer
![Page 58: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/58.jpg)
58
Research To Production
Example 2. Downsample called many times on the same data.
Scientist vs Engineer
Layer Block (e.g. Resnet)
Sub Layer 0 – (e.g. Normalization)
Sub Layer 1 –
Sub Layer S – (ResNet Shortcut)
Sub Layer
Downsample original Input
Do Layer Specific Stuff
![Page 59: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/59.jpg)
59
Research To Production
Example 2. Downsample called many times on the same data.
Scientist vs Engineer
Layer Block (e.g. Resnet)
Sub Layer 0 – (e.g. Normalization)
Sub Layer 1 –
Sub Layer S – (ResNet Shortcut)
Sub Layer
Downsample original Input
Do Layer Specific Stuff
This is the same operation on the
same data 3 times
![Page 60: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/60.jpg)
60
Research To Production
Example 2. Re order the operations….
Scientist vs Engineer
Layer Block (e.g. Resnet)
Sub Layer 0 – (e.g. Normalization)
Sub Layer 1 –
Sub Layer S – (ResNet Shortcut)
Sub Layer
Do Layer Specific Stuff
This is the same operation on the
same data 3 times
Downsample original Input
![Page 61: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/61.jpg)
61
Research To ProductionPorting to TensorRT
TensorRT
ONNX UFF C++ API
TensorFlow PyTorch Caffe etc.
3rd Party Utils Trt Python tools
Weights and Architecture
A bit of elbow grease
![Page 62: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/62.jpg)
62
Research To Production
• Most common architectures will import directly from TensorFlow/PyTorch etc
• Most common operations are already supported in TensorRT
• Convolution/Cross Correlation
• Activation
• Sigmoid, Relu, Clipped Relu, TanH, ELU
• Batch Norm
• Spatial, Spatial_persistent, Per Activation
• Pooling
• Max, Average
UFF, ONNX or API …. Which to use….
![Page 63: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/63.jpg)
63
Research To Production
• Sometimes it’s not that easy
• Sometimes some graph surgery is required.
• Edit the graph to strip out e.g. pre/post processing at either end of the graph
• TensorRT provides a plug-in interface for custom layers
• Name custom layers as per the incoming model (e.g. LeakyRelu)
• From TrT 5.1 : The IPlugInV2 interface supports optimization.
• There is a simpler option
UFF, ONNX or API …. Which to use….
![Page 64: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/64.jpg)
64
Research To Production
• Converts TensorFlow graph into 1 or more TensorRT ‘blocks’
• Adds these blocks back onto TensorFlow graph
• Inference of these blocks performed with TensorRT
• The rest use TensorFlow
• Workflow:
• Load TensorFlow graph
• Prepare for inference (freeze layers, convert variables to constants etc)
• Call trt.create_inference_graph(input_graph_def, outputs, max_batch_size,max_workspace_size,precision_mode)
Porting to TensorRT Using TfTrt
![Page 65: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/65.jpg)
65
Research To ProductionPorting to TensorRT Using TfTrt
TensorRT friendly
Custom Stuff
TensorRT friendly
TensorRT friendly
Custom Stuff
TensorFlow
TfTrt
TensorRT block
Custom Stuff
TensorRT block
TensorRT block
Custom Stuff
Still TensorFlow
TensorRT ‘blocks’ optimized and added back
onto the TensorFlow Graph
![Page 66: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/66.jpg)
66
Research To Production
Important takeaways from this
You don’t need generate a single monolithic graph with TensorRT
Generate graph snippets from TensorRT interleaved with custom CUDA
You can do this with the TensorRT API
Run them in whatever sequence you need at run time.
Allows you to create inference solutions with dynamic runtime behavior.
Keep all data on the GPU whenever possible.
Porting ‘Funky’ networks to TensorRT
![Page 67: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/67.jpg)
67
Research To Production
Here’s a debugging tip…
When it doesn’t ……. Just work.
Original Pytorch/TfWorking
TensorRT or cuDNNNot working
![Page 68: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/68.jpg)
68
Research To Production
Here’s a debugging tip…
When it doesn’t ……. Just work.
Original Pytorch/TfWorking
TensorRT or cuDNNNot working
Dump Layer Outputs to Disk
![Page 69: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/69.jpg)
69
Research To Production
Here’s a debugging tip…
When it doesn’t ……. Just work.
Original Pytorch/TfWorking
TensorRT or cuDNNNot working
Compare tensors with Python script.
Narrow down the problem to here.
![Page 70: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/70.jpg)
70
AGENDA
Introduction to cuDNN
cuDNN Best Practices:
Memory Management Done Right
Choosing the Right Convolution Algorithm & Tensor Layout
Tensor Cores: Low Precision Inference at Speed of Light
From Research to Production: It just works ... or not?!
Summary
![Page 71: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/71.jpg)
71
SUMMARY
Common DL frameworks often far from optimized for inference on GPUs
Use cuDNN (or TensorRT) if you care about performance & memory!
Memory Management matters!
Lower your precision if possible!
Use hardware-specific optimizations, e.g. Tensor Cores on Volta & Turing!
You can never profile too much!
Download, documentation, discussion board, …
www.developer.nvidia.com/cudnn
![Page 72: CNN INFERENCE WITH cuDNN - Nvidia...2 AGENDA Introduction to cuDNN cuDNN Best Practices: Memory Management Done Right Choosing the Right Convolution Algorithm & Tensor Layout Tensor](https://reader034.vdocuments.net/reader034/viewer/2022042710/5f5cd76ad87b743c7b57f576/html5/thumbnails/72.jpg)