an analysis of convolution for inference

An Analysis of Convolution for Inference24 June 2016

Scott Gray Nervana Systems

MAKING MACHINES SMARTER.™

Proprietary and confidential. Do not distribute.ne r vana

Direct Convolution

2

• Compute with in-place slicing + gemm

• Data layout considerations: C, H, W, N

• Minimize slicing logic

• Maximize contiguous access

• Leverage filter overlap


Small N direct convolution: Without Superblocking

3

fprop Q =⎾(W-S+1 + 2 * pad) / stride⏋ wi = sk + qj * stride - pad

Fig from V. Dumoulin, https://github.com/vdumoulin/conv_arithmetic

https://github.com/vdumoulin/conv_arithmetic


Small N direct convolution: With Superblocking

4

fprop Q =⎾(W-S+1 + 2 * pad) / stride⏋ wi = sk + qj * stride - pad


Small N direct convolution: Bprop for deconv

5

bprop pad’ = S - pad - 1 wi =⎿(qj - pad’ + sk) / stride⏌


Small N direct convolution: Dilated Filters

6

Dilated

S’ = (S-1) * rate + 1 Q =⎾(W-S’+1 + 2*pad) / stride⏋

wi = sk * rate + qj * stride - pad

Fig from F. Yu, V. Koltun http://arxiv.org/abs/1511.07122v3

http://arxiv.org/abs/1511.07122v3


Convolution with Algorithmic Speedups

7

• FFT and Winograd have same basic computational flow

• FFT tiles typically need to be much bigger

• Winograd history: Toom and Cook, then Lavin


Winograd: input transform

8

Input Feature Map

4x4 stride 2• Input transform

• 2D Winograd is a nested

product of 1D transforms

• Transforms can be

simplified to remove zeros


Winograd: filter transform

9

• Filter transform

• Same as input but with

different coefficients

• Transform each feature map

independently


Winograd: batched GEMM

10

• Point-wise Multiplication

• Posed as batched GEMM

operation


Winograd: output transform

11

Output Feature Map

• Output transform

• Same as input and filter

• Transform back to pixel

space to obtain 2x2 output

tile


Transforms for Increased Accuracy

12

Integer roots

4 0 -5 0 1 0

0 -4 -4 1 1 0

0 4 -4 -1 1 0

0 -2 -1 2 1 0

0 2 -1 -2 1 0

0 4 0 -5 0 1

0.87 0 -2.64 0 1 0

0 -1.4 -2.25 0.62 1 0

0 1.4 -2.25 -0.62 1 0

0 -0.58 -0.39 1.5 1 0

0 0.58 -0.39 -1.5 1 0

0 0.87 0 -2.64 0 1

Fractional roots

Input transforms for 4x4


Precision

13

Percentage error from Convolution

0

5

10

15

20

25

3 4 5 6 7 8 9 10 11 12 13 14 15 16

Direct2x2 Winograd4x4 winograd (Fractional Roots)4x4 Winograd (Integer Roots)

Perc

enta

ge E

rror

Bit width

Bits Direct 2x2 Winograd

4x4 frac 4x4 int3 56.461 112.174 351.196 314.624 23.533 46.222 274.28 432.9595 10.879 21.394 142.649 459.7236 5.245 10.34 68.062 446.2717 2.585 5.074 33.73 250.0578 1.286 2.516 16.667 123.5859 0.639 1.253 8.246 62.001

10 0.319 0.626 4.154 31.00611 0.159 0.312 2.064 15.43912 0.08 0.156 1.029 7.66913 0.04 0.078 0.515 3.85714 0.02 0.039 0.259 1.92315 0.01 0.019 0.129 0.96616 0.005 0.01 0.064 0.483


Multiplier Transistor Efficiency

14

Algo bits speedup transistorsperformance / transistor

Direct 8 1.0 3000 1

2x2 9 2.25 3750 1.8

4x4 12 4.0 6000 2.0

Transistor Counts from Wikipedia:


Logarithmic quantization

15

D. Miyashita, EH. Lee, B. Murmann Convolutional Neural Networks using Logarithmic Data Representation



Proprietary and confidential. Do not distribute.ne r vana 16

Performance: VGG fp32 on GTX1080ef

fect

ive

TFLO

PS

Batch Size

VGG - Totals:

0

5

10

15

20

25

64 32 16 8 4 2 1

Neon DirectNeon F(2x2,3x3)Neon F(4x4,3x3)cuDNN FFT

Proprietary and confidential. Do not distribute.ne r vana 17

Peak Performance: VGG fp32 on GTX1080ef

fect

ive

TFLO

PS

Batch Size

VGG - Layer 4.2:

0

5

10

15

20

25

64 32 16 8 4 2 1

Neon DirectNeon F(2x2,3x3)Neon F(4x4,3x3)cuDNN FFT

an analysis of convolution for inference

Technology