an analysis of convolution for inference

17
An Analysis of Convolution for Inference 24 June 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™

Upload: nervana-systems

Post on 16-Apr-2017

4.399 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: An Analysis of Convolution for Inference

An Analysis of Convolution for Inference24 June 2016

Scott Gray Nervana Systems

MAKING MACHINES SMARTER.™

Page 2: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Direct Convolution

2

• Compute with in-place slicing + gemm

• Data layout considerations: C, H, W, N

• Minimize slicing logic

• Maximize contiguous access

• Leverage filter overlap

Page 3: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Small N direct convolution: Without Superblocking

3

fprop Q =⎾(W-S+1 + 2 * pad) / stride⏋ wi = sk + qj * stride - pad

Fig from V. Dumoulin, https://github.com/vdumoulin/conv_arithmetic

Page 4: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Small N direct convolution: With Superblocking

4

fprop Q =⎾(W-S+1 + 2 * pad) / stride⏋ wi = sk + qj * stride - pad

Page 5: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Small N direct convolution: Bprop for deconv

5

bprop pad’ = S - pad - 1 wi =⎿(qj - pad’ + sk) / stride⏌

Page 6: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Small N direct convolution: Dilated Filters

6

Dilated

S’ = (S-1) * rate + 1 Q =⎾(W-S’+1 + 2*pad) / stride⏋

wi = sk * rate + qj * stride - pad

Fig from F. Yu, V. Koltun http://arxiv.org/abs/1511.07122v3

Page 7: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Convolution with Algorithmic Speedups

7

• FFT and Winograd have same basic computational flow

• FFT tiles typically need to be much bigger

• Winograd history: Toom and Cook, then Lavin

Page 8: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Winograd: input transform

8

Input Feature Map

4x4 stride 2• Input transform

• 2D Winograd is a nested

product of 1D transforms

• Transforms can be

simplified to remove zeros

Page 9: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Winograd: filter transform

9

• Filter transform

• Same as input but with

different coefficients

• Transform each feature map

independently

Page 10: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Winograd: batched GEMM

10

• Point-wise Multiplication

• Posed as batched GEMM

operation

Page 11: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Winograd: output transform

11

Output Feature Map

• Output transform

• Same as input and filter

• Transform back to pixel

space to obtain 2x2 output

tile

Page 12: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Transforms for Increased Accuracy

12

Integer roots

4 0 -5 0 1 0

0 -4 -4 1 1 0

0 4 -4 -1 1 0

0 -2 -1 2 1 0

0 2 -1 -2 1 0

0 4 0 -5 0 1

0.87 0 -2.64 0 1 0

0 -1.4 -2.25 0.62 1 0

0 1.4 -2.25 -0.62 1 0

0 -0.58 -0.39 1.5 1 0

0 0.58 -0.39 -1.5 1 0

0 0.87 0 -2.64 0 1

Fractional roots

Input transforms for 4x4

Page 13: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Precision

13

Percentage error from Convolution

0

5

10

15

20

25

3 4 5 6 7 8 9 10 11 12 13 14 15 16

Direct2x2 Winograd4x4 winograd (Fractional Roots)4x4 Winograd (Integer Roots)

Perc

enta

ge E

rror

Bit width

Bits Direct 2x2 Winograd

4x4 frac 4x4 int3 56.461 112.174 351.196 314.624 23.533 46.222 274.28 432.9595 10.879 21.394 142.649 459.7236 5.245 10.34 68.062 446.2717 2.585 5.074 33.73 250.0578 1.286 2.516 16.667 123.5859 0.639 1.253 8.246 62.001

10 0.319 0.626 4.154 31.00611 0.159 0.312 2.064 15.43912 0.08 0.156 1.029 7.66913 0.04 0.078 0.515 3.85714 0.02 0.039 0.259 1.92315 0.01 0.019 0.129 0.96616 0.005 0.01 0.064 0.483

Page 14: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Multiplier Transistor Efficiency

14

Algo bits speedup transistorsperformance / transistor

Direct 8 1.0 3000 1

2x2 9 2.25 3750 1.8

4x4 12 4.0 6000 2.0

Transistor Counts from Wikipedia:

Page 15: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana

Logarithmic quantization

15

D. Miyashita, EH. Lee, B. Murmann Convolutional Neural Networks using Logarithmic Data Representation

http://arxiv.org/abs/1603.01025v2

Page 16: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana 16

Performance: VGG fp32 on GTX1080ef

fect

ive

TFLO

PS

Batch Size

VGG - Totals:

0

5

10

15

20

25

64 32 16 8 4 2 1

Neon DirectNeon F(2x2,3x3)Neon F(4x4,3x3)cuDNN FFT

Page 17: An Analysis of Convolution for Inference

Proprietary and confidential. Do not distribute.ne r vana 17

Peak Performance: VGG fp32 on GTX1080ef

fect

ive

TFLO

PS

Batch Size

VGG - Layer 4.2:

0

5

10

15

20

25

64 32 16 8 4 2 1

Neon DirectNeon F(2x2,3x3)Neon F(4x4,3x3)cuDNN FFT