s8822 optimizing nmt with tensorrt - nvidia · inference throughput (sentences/sec) on opennmt...

40
Micah Villmow Senior TensorRT Software Engineer S8822 – OPTIMIZING NMT WITH TENSORRT

Upload: others

Post on 18-Oct-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

Micah Villmow

Senior TensorRT Software Engineer

S8822 – OPTIMIZING NMT WITH TENSORRT

Page 2: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

2

100倍以上速く、本当に可能ですか?

2

Page 3: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

3

Neural Machine

Translation Unit

DOUGLAS ADAMS – BABEL FISH

Page 4: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

4

OVER 100X FASTER,IS IT REALLY POSSIBLE?

Over 200

years

4

Page 5: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

5

NVIDIA TENSORRTProgrammable Inference Accelerator

developer.nvidia.com/tensorrt

DRIVE PX 2

JETSON TX2

NVIDIA DLA

TESLA P4

TESLA V100

FRAMEWORKS GPU PLATFORMS

TensorRT

Optimizer Runtime

Page 6: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

6

• Convolution

• LSTM and GRU

• Activation: ReLU, tanh, sigmoid

• Pooling: max and average

• Scaling

• Element wise operations

• LRN

• Fully-connected

• SoftMax

• Deconvolution

TENSORRT LAYERS

Built-in Layer Support Custom Layer API

CUDA Runtime

Deployed Application

TensorRT Runtime

Custom Layer

Page 7: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

7

TENSORRT OPTIMIZATIONS

Kernel Auto-Tuning

Layer & Tensor Fusion

Dynamic Tensor

Memory

Weights & Activation

Precision Calibration

140 305

5700

14 ms

6.67 ms 6.83 ms

0

5

10

15

20

25

30

35

40

0

1,000

2,000

3,000

4,000

5,000

6,000

CPU-Only V100 +TensorFlow

V100 + TensorRT

Late

ncy (m

s)Images/

sec

Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-

16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16),

batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587

Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake

with AVX512.

40x Faster CNNs on V100 vs. CPU-Only

Under 7ms Latency (ResNet50)

Page 8: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

8

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

Page 9: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

9

ACRONYMS AND DEFINITIONS

NMT: Neural Machine Translation

OpenNMT: Open source NMT project for academia and industry

Token: The minimum representation used for encoding(symbol, word, character, subword)

Sequence: A number of tokens wrapped by special start and end sequence tokens.

Beam Search: directed partial breadth-first tree search algorithm

TopK: Partial sort resulting in N min/max elements

Unk: Special token that represents unknown translations.

Page 10: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

10

OPENNMT INFERENCEDecoder

Encoder

Beam

Search

EncoderRNN

Decoder

RNN

Attention

Model

Projection

TopK

Output

Beam

Shuffle

Batc

h R

eductio

n

Beam

Scorin

g

Input

Setup

Input

Page 11: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

11

DECODER EXAMPLE

Decoder

RNN

Attention

Model

Projection

TopK

Output Embedding

Beam

Searc

h

Beam

Shuff

le

Batc

h

Reducti

on

Beam

Scori

ng

Input Embedding

Itera

tion 0

<S>

This

The

He

What

The

Itera

tion 1

+

ishouse

ran

tim

ecow

This

The

He

What

The

Page 12: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

12

TRAINING VS INFERENCE

Decoder

Encoder

EncoderRNN Decoder

RNN

Attention Model

Projection

Output

Input

SetupInput

Decoder

Encoder

Beam

Search

EncoderRNN

Decoder

RNN

Attention

Model

Projection

TopK

OutputBeam

Shuffle

Batc

h R

eductio

n

Beam

Scorin

g

Input

Setup

Input

Page 13: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

13

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

Page 14: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

14

INFERENCE TIME IS BEAM SEARCH TIME

• Wu, Et. Al. 2016, ‘Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation’ arXiv:1609.08144

• Sharan Narang, Jun, 2017, Baidu’s DeepBench -https://github.com/baidu-research/DeepBench

• Rui Zhao, Dec, 2017, ‘Why does inference run 20x slower than training.’ - https://github.com/tensorflow/nmt/issues/204

• David Levinthal, Ph.D., Jan, 2018, ‘Evaluating RNN performance across hardware platforms.’

Page 15: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

15

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

Page 16: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

16

PERF ANALYSIS

Page 17: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

17

KERNEL ANALYSIS

Page 18: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

18

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

Page 19: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

19

ENCODER

Encoder

EncoderRNN

Input

Setup

Page 20: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

20

Sequence

Length

Buffer

PrefixSumPlugin

Input

Hello.This is a test.

Bye.Tokenization

Hello .This is a test .Bye .

Gather

Encoder

Input

42 23 0 0 0 0

73 3 8 19 23 0

98 23 0 0 0 0

2

5

2

Decoder

Start

Tokens

Constant

Zero

State

buffer

Setup

Page 21: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

21

Sequence

LengthsEncoder

Input

Encoder

42 23 0 0 0 0

73 3 8 19 23 0

98 23 0 0 0 0

2

5

2

PackedRNN

Trained

Hidden

State

Trained

Cell

State

Encoder

Hidden

State

Encoder

Cell

StateContext

Vector

.1 .35 0 0 0 0

.123 .93 1.4 1 .01 0

.42 .20 0 0 0 0

Embedding Plugin

Page 22: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

22

DECODER

Decoder

Beam

Search

Decoder

RNN

Attention Model

Projection

TopK

Output

Beam

Shuffle

Batc

h R

eductio

n

Beam

Scorin

g

Input

Page 23: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

23

Decoder, 1st Iteration

RNN

Encoder

Hidden

State

Encoder

Cell

State

Decode

Hidden

State

Decode

Cell

State

Decoder

Input

Batch0 <S>

BatchN <S>

Decoder

Output

Batch0 .124

BatchN .912

Start Sentence

Token

Embedding Plugin

Page 24: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

24

Decoder, 2nd+ Iteration

RNN

Prev

Hidden

State

Prev

Cell

State

Next

Hidden

State

Next

Cell

State

Decoder

Input

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 こ ん に ち は

N さ よ う な ら

Batch0 .124

BatchN .912

Decoder

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 .18 .32 .85 .39 .75

N .79 .27 .81 .93 .73

Embedding Plugin

Page 25: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

25

Global Attention Model

Context

Vector

Sequence

Length

Buffer

Decoder

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 .18 .32 .85 .39 .75

N .79 .27 .81 .93 .73

FullyConnected Weights

Weights

BatchedGemm

RaggedSoftmax

Concat

FullyConnected

TanH

BatchedGemm

Page 26: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

26

Projection

Atte

ntio

n

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0 [.9,…,.1] [0,…,.3] [.1,…,0] [.6,…,.8] [.3,…,.2]

N [.4,…,.9] [.5,…,.2] [0,…,.7] [0,…,2] [.1,…,.9]

FullyConnected Weights

Softmax

Pro

jectio

n

Outp

ut

Log

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0

N

Page 27: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

27

TopK Part 1

Pro

jectio

n

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

0

N

TopK

Intra

-beam

Outp

ut

Batch Beam 0 Beam1 Beam2 Beam3 Beam4

Index

Prob

[1,3]

[.9,.8]

[2,4]

[.99,.5]

[9,0]

[.3,.8]

[5,0]

[.1,.93]

[7,6]

[.85,.99]

Gather

Page 28: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

28

TopK Part 2

Gath

er

Outp

ut

Prob [.9,.8,.99,.55,.3,.8,.1,.93,.85,.99]

TopK

Inte

r-beam

Outp

ut

Indices [2,9,7,0,8]

Prob [.99,.99,.93,.9,.85]

Beam Mapping Plugin

Intra-beam

Output

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Page 29: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

29

Beam Search – Beam Shuffle

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Beam0

State

Beam1

State

Beam2

State

Beam3

State

Beam4

State

Beam1

State+1

Beam4

State+1

Beam3

State+1

Beam0

State+1

Beam4

State+1

Beam

Shuffle

Plugin

Page 30: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

30

Beam Search – Beam Scoring

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Beam0

State+1

Beam1

State+1

Beam2

State+1

Beam3

State+1

Beam4

State+1

Beam0

State

Beam1

State

Beam2

State

Beam3

State

Beam4

State

Beam Scoring Plugin

EOS

DetectionSentence

Probability

UpdateBacktrack

State

Storage Sequence

Length

IncrementEnd of

Beam/Batch

Heuristic

Batch Finished

Bitmap[0001100011…010]

Page 31: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

31

Beam Search – Batch Reduction

Batch Finished

Bitmap[0001100011…010]

Reduce Operation(Sum)Transfer 32bit to Host

as new batch size.

TopK

Gather

Encoder

Output

Encoder

Output

Encoder/State

Reduction

PluginBeam

State

Beam

State

Page 32: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

32

Output

Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

All Done? No Decoder Input

Yes

Beam

StateDevice To Host

On Host:

Beam

Sta

te Output

こんにちは。これはテストです。さようなら。

Page 33: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

33

TENSORRT ANALYSIS

Page 34: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

34

TENSORRT KERNEL ANALYSIS

Page 35: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

35

Agenda

•What is NMT?

•What is current state?

•What are the problems?

•How did we solve it?

•What perf is possible?

Page 36: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

36

RESULTS

425

550

280 ms

153 ms

117 ms

0

50

100

150

200

250

300

350

400

450

500

0

100

200

300

400

500

600

CPU-Only + Torch V100 + Torch V100 + TensorRT

Late

ncy (m

s)

Sente

nces/

sec

Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size

4, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On

140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference

(OpenNMT)

Page 37: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

37

SUMMARY• Show that topK no longer dominates sequence inference time.

• Show that RNN Inference is compute bound, not memory bound.

• TensorRT accelerates sequence inferencing.

• Over two orders of magnitude higher throughput over CPU.

• Latency reduction by more than half over CPU.

developer.nvidia.com/tensorrt

PRODUCT PAGE

Page 38: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

38

LEARN MORE

developer.nvidia.com/tensorrt

PRODUCT PAGE

docs.nvidia.com/deeplearning/sdk

DOCUMENTATION

nvidia.com/dli

TRAINING

Page 39: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,

39

Q&A

Page 40: S8822 OPTIMIZING NMT WITH TENSORRT - NVIDIA · Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB,