an analysis of convolution for inference
TRANSCRIPT
An Analysis of Convolution for Inference24 June 2016
Scott Gray Nervana Systems
MAKING MACHINES SMARTER.™
Proprietary and confidential. Do not distribute.ne r vana
Direct Convolution
2
• Compute with in-place slicing + gemm
• Data layout considerations: C, H, W, N
• Minimize slicing logic
• Maximize contiguous access
• Leverage filter overlap
Proprietary and confidential. Do not distribute.ne r vana
Small N direct convolution: Without Superblocking
3
fprop Q =⎾(W-S+1 + 2 * pad) / stride⏋ wi = sk + qj * stride - pad
Fig from V. Dumoulin, https://github.com/vdumoulin/conv_arithmetic
Proprietary and confidential. Do not distribute.ne r vana
Small N direct convolution: With Superblocking
4
fprop Q =⎾(W-S+1 + 2 * pad) / stride⏋ wi = sk + qj * stride - pad
Proprietary and confidential. Do not distribute.ne r vana
Small N direct convolution: Bprop for deconv
5
bprop pad’ = S - pad - 1 wi =⎿(qj - pad’ + sk) / stride⏌
Proprietary and confidential. Do not distribute.ne r vana
Small N direct convolution: Dilated Filters
6
Dilated
S’ = (S-1) * rate + 1 Q =⎾(W-S’+1 + 2*pad) / stride⏋
wi = sk * rate + qj * stride - pad
Fig from F. Yu, V. Koltun http://arxiv.org/abs/1511.07122v3
Proprietary and confidential. Do not distribute.ne r vana
Convolution with Algorithmic Speedups
7
• FFT and Winograd have same basic computational flow
• FFT tiles typically need to be much bigger
• Winograd history: Toom and Cook, then Lavin
Proprietary and confidential. Do not distribute.ne r vana
Winograd: input transform
8
Input Feature Map
4x4 stride 2• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros
Proprietary and confidential. Do not distribute.ne r vana
Winograd: filter transform
9
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently
Proprietary and confidential. Do not distribute.ne r vana
Winograd: batched GEMM
10
• Point-wise Multiplication
• Posed as batched GEMM
operation
Proprietary and confidential. Do not distribute.ne r vana
Winograd: output transform
11
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile
Proprietary and confidential. Do not distribute.ne r vana
Transforms for Increased Accuracy
12
Integer roots
4 0 -5 0 1 0
0 -4 -4 1 1 0
0 4 -4 -1 1 0
0 -2 -1 2 1 0
0 2 -1 -2 1 0
0 4 0 -5 0 1
0.87 0 -2.64 0 1 0
0 -1.4 -2.25 0.62 1 0
0 1.4 -2.25 -0.62 1 0
0 -0.58 -0.39 1.5 1 0
0 0.58 -0.39 -1.5 1 0
0 0.87 0 -2.64 0 1
Fractional roots
Input transforms for 4x4
Proprietary and confidential. Do not distribute.ne r vana
Precision
13
Percentage error from Convolution
0
5
10
15
20
25
3 4 5 6 7 8 9 10 11 12 13 14 15 16
Direct2x2 Winograd4x4 winograd (Fractional Roots)4x4 Winograd (Integer Roots)
Perc
enta
ge E
rror
Bit width
Bits Direct 2x2 Winograd
4x4 frac 4x4 int3 56.461 112.174 351.196 314.624 23.533 46.222 274.28 432.9595 10.879 21.394 142.649 459.7236 5.245 10.34 68.062 446.2717 2.585 5.074 33.73 250.0578 1.286 2.516 16.667 123.5859 0.639 1.253 8.246 62.001
10 0.319 0.626 4.154 31.00611 0.159 0.312 2.064 15.43912 0.08 0.156 1.029 7.66913 0.04 0.078 0.515 3.85714 0.02 0.039 0.259 1.92315 0.01 0.019 0.129 0.96616 0.005 0.01 0.064 0.483
Proprietary and confidential. Do not distribute.ne r vana
Multiplier Transistor Efficiency
14
Algo bits speedup transistorsperformance / transistor
Direct 8 1.0 3000 1
2x2 9 2.25 3750 1.8
4x4 12 4.0 6000 2.0
Transistor Counts from Wikipedia:
Proprietary and confidential. Do not distribute.ne r vana
Logarithmic quantization
15
D. Miyashita, EH. Lee, B. Murmann Convolutional Neural Networks using Logarithmic Data Representation
http://arxiv.org/abs/1603.01025v2
Proprietary and confidential. Do not distribute.ne r vana 16
Performance: VGG fp32 on GTX1080ef
fect
ive
TFLO
PS
Batch Size
VGG - Totals:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon DirectNeon F(2x2,3x3)Neon F(4x4,3x3)cuDNN FFT
Proprietary and confidential. Do not distribute.ne r vana 17
Peak Performance: VGG fp32 on GTX1080ef
fect
ive
TFLO
PS
Batch Size
VGG - Layer 4.2:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon DirectNeon F(2x2,3x3)Neon F(4x4,3x3)cuDNN FFT