deploying deep neural networks to embedded gpus and cpus · deploying deep neural networks to...

40
1 © 2015 The MathWorks, Inc. Deploying Deep Neural Networks to Embedded GPUs and CPUs Abhijit Bhattacharjee

Upload: others

Post on 22-Mar-2020

23 views

Category:

Documents


1 download

TRANSCRIPT

1© 2015 The MathWorks, Inc.

Deploying Deep Neural Networks

to Embedded GPUs and CPUs

Abhijit Bhattacharjee

2

Deep Learning Workflow in MATLAB

Application

logic

Application

Design

Standalone

Deployment

Deep Neural Network

Design + Training

Trained

DNN

3

Deep Neural Network Design and Training

▪ Design in MATLAB

▪ Manage large data sets

▪ Automate data labeling

▪ Easy access to models

▪ Training in MATLAB

▪ Acceleration with GPU’s

▪ Scale to clusters

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learning

Reference

model

4

Application Design

Pre-

processing

Post-

processing

5

Multi-Platform Deep Learning Deployment

Application

logic

6

Multi-Platform Deep Learning Deployment

Application

logic

Embedded

MobileNVIDIA Jetson Raspberry pi

Beaglebone

Desktop Data Center

7

Algorithm Design to Embedded Deployment Workflow

Conventional Approach

Functional test1

Desktop

GPU

Deployment

unit-test

2

Desktop

GPU

C++

Deployment

integration-test

3

Embedded GPU

C++

Real-time test4

High-level language

Deep learning framework

Large, complex software stack

Challenges

• Integrating multiple libraries and packages

• Verifying and maintaining multiple

implementations

• Algorithm & vendor lock-in

C/C++

Low-level APIs

Application-specific libraries

C/C++

Target-optimized libraries

Optimize for memory & speed

8

Solution: Use MATLAB Coder & GPU Coder for

Deep Learning Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

ARM

Compute

Library

10

Deep Learning Deployment Workflows

Pre-

processing

Post-

processing

codegen

Portable target code

INTEGRATED APPLICATION DEPLOYMENT

cnncodegen

Portable target code

INFERENCE ENGINE DEPLOYMENT

Trained

DNN

12

Workflow for Inference Engine Deployment

Steps for inference engine deployment

1. Generate the code for trained model>> cnncodegen(net, 'targetlib’, ‘arm-

compute’)

2. Copy the generated code onto target board

3. Use hand written main function to call inference

engine

4. Generate the exe and test the executable>> make –C ./ ……

cnncodegen

Portable target code

INFERENCE ENGINE DEPLOYMENT

Trained

DNN

14

Deep Learning Inference Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

ARM

Compute

Library

Pedestrian Detection

Includes ARM Cortex-A support

15

Deep Learning Inference Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

ARM

Compute

Library

Blood Smear Segmentation

16

ARM

Compute

Library

Deep Learning Inference Deployment

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

Defect Classification &

Detection

18

ARM

Compute

Library

Application

logic

GPU

Coder

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

MATLAB

Coder

Target Libraries

How is the

Performance?

19

Performance of Generated Code

▪ CNN inference (ResNet-50, VGG-16, Inception V3) on Titan V GPU

▪ CNN inference (ResNet-50) on Jetson TX2

▪ CNN inference (ResNet-50 , VGG-16, Inception V3) on Intel Xeon CPU

20Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10.0/1 - cuDNN 7.5.0 - Frameworks: TensorFlow 1.13.1, MXNet 1.4.1 PyTorch 1.1

Single Image Inference on Titan V using cuDNN

21

GPU Coder

TensorRT Accelerates Inference Performance on Titan V

TensorFlow

Single Image Inference with ResNet-50 (Titan V)

cuDNN TensorRT (FP32) TensorRT (INT8)

Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10.0/1 - cuDNN 7.5.0 - TensorRT 5.1.2 - Frameworks: TensorFlow 1.13.1

23

Single Image Inference on Jetson TX2

GPU Coder

+

TensorRT

TensorFlow

+

TensorRT

ResNet-50

NVIDIA libraries: CUDA10.0/1 - cuDNN 7.5.0 - TensorRT 5.1.2 - Frameworks: TensorFlow 1.13.1

24Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10.0/1 - cuDNN 7.5.0 - Frameworks: TensorFlow 1.13.1, MXNet 1.4.1 PyTorch 1.1

CPU Performance

MATLAB

TensorFlow

MXNet

MATLAB Coder

PyTorch

CPU, Single Image Inference (Linux)

SqueezeNet

25

Brief Summary

DNN libraries are great for inference, …

MATLAB Coder and GPU Coder generates code that takes advantage of:

NVIDIA® CUDA libraries, including TensorRT & cuDNN

Intel® Math Kernel Library for Deep Neural Networks

(MKL-DNN)

ARM® Compute libraries for mobile platforms

26

Brief Summary

DNN libraries are great for inference, …

MATLAB Coder and GPU Coder generates code that takes advantage of:

NVIDIA® CUDA libraries, including TensorRT & cuDNN

Intel® Math Kernel Library for Deep Neural Networks

(MKL-DNN)

ARM® Compute libraries for mobile platforms

But, Applications

Require More than

just Inference

27

Deep Learning Workflows: Integrated Application Deployment

Pre-

processing

Post-

processing

codegen

Portable target code

29

Lane

Detection

Strongest

Bounding

Box

Lane and Object Detection using YOLO v2

Post-

processing

Object

Detection

Workflow:

1) Test in MATLAB on CPU

2) Generate code and test on

desktop GPU

3) Generate code and test on

Jetson AGX Xavier GPU

AlexNet-based YOLO v2

30

Lane

Detection

Strongest

Bounding

Box

(1) Test in MATLAB on CPU

Post-

processing

Object

Detection

AlexNet-based YOLO v2

31

Lane

Detection

Strongest

Bounding

Box

(2) Generate Code and Test on Desktop GPU

Post-

processing

Object

Detection

cuDNN/TensorRT optimized code

CUDA optimized code

AlexNet-based YOLO v2

32

Lane

Detection

Strongest

Bounding

Box

(3) Generate Code and Test on Jetson AGX Xavier GPU

Post-

processing

Object

Detection

cuDNN/TensorRT optimized code

CUDA optimized code

AlexNet-based YOLO v2

33

Lane

Detection

Strongest

Bounding

Box

Lane and Object Detection using YOLO v2

Post-

processing

Object

Detection

cuDNN/TensorRT optimized code

CUDA optimized code

AlexNet-based YOLO v2

1) Running on CPU

2) 7X faster running generate

code on desktop GPU

3) Generate code and test on

Jetson AGX Xavier GPU

39

Accessing Hardware

Deploy Standalone

Application

Access Peripheral

from MATLAB

Processor-in-Loop

Verification

40

Deploy to Target Hardware via Apps and Command Line

41

42

How does

MATLAB Coder and

GPU Coder

achieve these results?

43

Coders Apply Various Optimizations

….

….

CUDA kernel

lowering

Traditional compiler

optimizations

MATLAB Library function mapping

Parallel loop creation

CUDA kernel creation

cudaMemcpy minimization

Shared memory mapping

CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loop

optimizations

44

Coders Apply Various Optimizations

….

….

CUDA kernel

lowering

Traditional compiler

optimizations

MATLAB Library function mapping

Parallel loop creation

CUDA kernel creation

cudaMemcpy minimization

Shared memory mapping

CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loop

optimizations

Optimized

Libraries

Network

OptimizationCoding

Patterns

45

Intel MKL-DNN

Library

Generated Code Calls Optimized Libraries

Pre-

processing

Post-

processing

NVIDIA TensorRT &

cuDNN Libraries

ARM Compute

Library

cuFFT, cuBLAS,

cuSolver, Thrust

Libraries

Performance

1. Optimized Libraries

2. Network Optimizations

3. Coding Patterns

46

Network

Deep Learning Network Optimization

conv

Batch

Norm

ReLu

Add

conv

ReLu

Max

Pool

Max

Pool

Layer fusion

Optimized computation

FusedConv

FusedConv

BatchNormAdd

Max

Pool

Max

Pool

Buffer minimization

Optimized memory

FusedConv

FusedConv

BatchNormAdd

Max

Pool

buffer a

buffer b

buffer d

Max

Pool

buffer c

buffer e

X Reuse buffer a

X Reuse buffer b

Performance

1. Optimized Libraries

2. Network Optimizations

3. Coding Patterns

47

Coding Patterns: Stencil Kernels

▪ Automatically applied for image processing functions (e.g. imfilter, imerode,

imdilate, conv2, …)

▪ Manually apply using gpucoder.stencilKernel()

Dotprod

Input image Conv. kernel Output image

rows

cols

kw

kh

Performance

1. Optimized Libraries

2. Network Optimizations

3. Coding Patterns

48

Coding Patterns: Matrix-Matrix Kernels

▪ Automatically applied for many MATLAB functions (e.g. matchFeatures

SAD, SSD, pdist, …)

▪ Manually apply using gpucoder.matrixMatrixKernel()

Performance

1. Optimized Libraries

2. Network Optimizations

3. Coding Patterns

49

Deep Learning Workflow in MATLAB

Deep Neural Network

Design + Training

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learning

Reference

model

Application

Design

Application

logic

Standalone

Deployment

TensorRT and

cuDNN Libraries

MKL-DNN

Library

Coders

ARM Compute

Library

Application

logic

Application

Design

Standalone

Deployment

Deep Neural Network

Design + Training

Trained

DNN

50

Deep Learning Workflow in MATLAB

Deep Neural Network

Design + Training

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learning

Reference

model

Application

Design

Application

logic

Standalone

Deployment

TensorRT and

cuDNN Libraries

MKL-DNN

Library

Coders

ARM Compute

Library

51

Call to action

▪ Visit the Deep Learning Booth!

▪ Related upcoming talks:

– AI Techniques for Signals, Time-series, and Text Data

– Sensor Fusion and Tracking for Autonomous Systems

– Deploying Deep Neural Networks to Embedded GPUs and CPUs