ralph wittig, distinguished engineer office of the cto, xilinx · your logo here pruning elements r...
TRANSCRIPT
Revolutionizing the Datacenter
Join the Conversation #OpenPOWERSummit
Power-Efficient Machine Learning using FPGAs on POWER Systems
Ralph Wittig, Distinguished Engineer
Office of the CTO, Xilinx
Join the Conversation #OpenPOWERSummit
Your logohere
Super Human
Humans: ~95%***
Top-5 Accuracy Image ClassificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)
* http://image-net.org/challenges/LSVRC/**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf
**
Page 2
Your logohere
Super Human
Humans: ~95%***
Top-5 Accuracy Image ClassificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)
CNNs far outperform non AI methods
CNNs deliver super-human accuracy
**
* http://image-net.org/challenges/LSVRC/**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf
Page 3
Your logohere
CNNs Explained
Page 4
Your logohere
The Computation
Page 5
Your logohere
The Computation
Page 6
Your logohere
Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights
Page 7
Convolution
13
13
384
3
3
384
13
13
256
Input256
KernelWeights
Output
Your logohere
Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights
Page 8
Convolution
13
13
384
3
3
384
13
13
256
Input256
KernelWeights
Output
Your logohere
Continue along the row ...
Page 9
Convolution
13
13
384
3
3
384
13
13
256
Input256
KernelWeights
Output
Your logohere
Before moving down to the next row
Page 10
Convolution
13
13
384
3
3
384
13
13
256
Input256
KernelWeights
Output
Your logohere
The first output feature map is complete
Page 11
Convolution
13
13
384
3
3
384
13
13
256
Input256
KernelWeights
Output
Your logohere
Move onto the next output feature map by switching weights, and repeat
Page 12
Convolution
13
13
384
3
3
384
13
13
256
Input256
KernelWeights
Output
Your logohere
Pattern repeats as before: same input volumes, different weight
Page 13
Convolution
13
13
384
3
3
384
13
13
256
Input256
KernelWeights
Output
Your logohere
Complete the second output feature map plane
Page 14
Convolution
13
13
384
3
3
384
13
13
256
Input256
KernelWeights
Output
Your logohere
Finally, after 256 weight sets have been used, the output feature map is complete
Page 15
Convolution
13
13
384
3
3
384
13
13
256
Input256
KernelWeights
Output
Your logohere
Fully Connected Layers
Page 16
Your logohere
Fully Connected Layers
a0,0
a0,1
a1,40 95
a1,1
a1,0
w0,0,0
w0,0,1
)*(0
,0,0,00,1
i
ii wafa
a2,1
a2,0
w1,40 95,1
w1,40 95,0
a0,40 95
w0,0,40 95
a2,99 9
w1,40 95,99 9
)*(0
,0,1,1999,2
i
ii wafa
fc6 fc7fc7 fc8
Page 17
Your logohere
Compute: dominated by convolution (CONV) layers
0.2
1
0.3
4
0.1
7 3
.87
3.8
7
0.9
0
0.8
3
1.8
5 5
.55
5.5
5
0.3
0
0.3
0
5.5
5 9
.25
12
.95
0.4
5
0.4
5
5.5
5 9
.25
12
.95
0.3
0
0.3
0
1.8
5
2.3
1
3.7
0
0.0
8
0.1
0
0.2
1
0.2
1
0.2
1
0.0
3
0.0
3
0.0
3
0.0
3
0.0
3
0.0
1
0.0
1
0.0
1
0.0
1
0.0
1
CaffeNet ZF VGG11 VGG16 VGG19
CONV1 CONV2 CONV3 CONV4CONV5 FC6 FC7 FC8
0.1
4
0.0
6
0.0
1
0.1
5
0.1
5
1.2
3
2.4
6
0.2
9
0.8
8
0.8
8
3.5
4
3.5
4
3.5
4
5.9
0
8.2
6
2.6
5
5.3
1
14
.16
23
.59
33
.03
1.7
7
3.5
4
18
.87
28
.31
37
.75
15
0.9
9
20
9.7
2
411
.04
411
.04
411
.04
67
.11
67
.11
67
.11
67
.11
67
.11
16
.38
16
.38
16
.38
16
.38
16
.38
CaffeNet ZF VGG11 VGG16 VGG19
CONV1 CONV2 CONV3 CONV4
Co
mp
ute
GO
Ps P
er L
ayer
Mem
ory
Acc
ess
G R
ead
s Pe
r La
yer
Source: Yu Wang, Tsinghua University, Feb 2016
CNN Properties
Memory BW: dominated by fully-connected (FC) layers
Page 18
Your logohere
Humans vs Machines
Humans are six orders of magnitude more efficient
*IBM Watson, ca 2012
Source: Yu Wang, Tsinghua University, Feb 2016
*
Page 19
Your logohere
Cost of Computation
Source: William Dally, “High Performance Hardware for Machine Learning”Cadence ENN Summit, 2/9/2016.Page 20
Your logohere
21
Cost of Computation
Stay in on-chip memory (1/100 x power)
Use Smaller Multipliers (8bits vs 32bits: 1/16 x power)
Fixed-Point vs Float (don’t waste bits on dynamic range)
Source: William Dally, “High Performance Hardware for Machine Learning”Cadence ENN Summit, 2/9/2016.Page 21
Your logohere
Improving Machine Efficiency
Model Pruning
Right-Sizing Precision
Custom CNN Processor Architecture
Page 22
Your logohere
Pruning Elements Retrain to Recover Accuracy
Train Connectivity
Prune Connections
Train Weights
-4.5%
-4.0%
-3.5%
-3.0%
-2.5%
-2.0%
-1.5%
-1.0%
-0.5%
0.0%
0.5%
40% 50% 60% 70% 80% 90% 100%
Accu
racy L
oss
Parametes Pruned Away
L2 regularization w/o retrain L1 regularization w/o retrain
L1 regularization w/ retrain L2 regularization w/ retrain
L2 regularization w/ iterative prune and retrain
Pruned Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Remove Low Contribution Weights (Synapses)
Retrain Remaining Weights
Source: Han, et al, “Learning both Weights and Connections for Efficient Neural Networks”http://arxiv.org/pdf/1506.02626v3.pdf
Page 23
Your logohere
Pruning Results: AlexNet
Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdf
9x Reduction In #Weights
Most Reduction In FC Layers
Page 24
Your logohere
Pruning Results: AlexNet
< 0.1% Accuracy Loss
Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdfPage 25
Your logohere
Inference with Integer Quantization
Page 26
Your logohere
Right-Sizing Precision
Dynamic: Variable Format Fixed-Point (Per Layer)
< 1% Accuracy Loss
Network VGG16
Data Bits Single-float 16 16 8 8
Weight Bits Single-float 16 8 8 8 or 4
Data Precision N/A 2-2 2-2 2-5/2-1 Dynamic
Weight Precision N/A 2-15 2-7 2-7 Dynamic
Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0%
Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6%
Source: Yu Wang, Tsinghua University, Feb 2016 Page 27
Your logohere
Right-Sizing Precision
Fixed-Point Sufficient For Deployment (INT16, INT8)
No Significant Loss in Accuracy (< 1%)
>10x Energy Efficiency OPs/J (INT8 vs FP32)
4x Memory Energy Efficiency Tx/J (INT8 vs FP32)
Page 28
Your logohere
Improving Machine Efficiency
CNN Model
Pruned
Floating-Point Model
Pruned
Fixed-Point Model
Instructions
FPGA Based
Neural Network
Processor
Modelpruning
Data/weightquantization
Compilation
Run
Modified From: Yu Wang, Tsinghua University, Feb 2016 Page 29
Your logohere
Xilinx Kintex® UltraScale™ KU115 (20nm)
5520 DSP Cores,up to 500Mhz
5.5 T OPs int16 (peak)
4 GB DDR4-2400 & 38 GB/s
55W TDP & 100 G OPs/W
Single Slot, Low Profile FF
OpenPOWER CAPI AlphaData ADM-PCIE-8K5
Page 30
Your logohere
FPGA Architecture
CLB DSP CLBRAM RAM
CLB DSP
CLB DSP
CLB
CLB
CLB DSP CLB
RAM
RAM
RAM
RAM
RAM
RAM
. . . .
. . . .
2D Array Architecture (scales with Moore’s Law)
Memory Proximate Computing (Minimize Data Moves)
Broadcast Capable Interconnect (Data Sharing/Reuse)
Page 31
Your logohere
FPGA Arithmetic & Memory Resources
Wij
Dj
Oi
16-bitMultiplier
Native 16-bit multiplier (or reduced power 8-bit)
On-Chip RAMs store INT4, INT8, INT16, …
Custom Quantization Formatting (Qm.n)
48-bitAccumulator
Q8.8Q2.14Qm.n
CustomQuantization
Custom WidthMemory
INT4INT8
INT16INT32FP16FP32
Page 32
Your logohere
Convolver Unit
⋯⋯ ⋯⋯
⋯⋯ ⋯⋯
⋯⋯
MUX
MUX
Data buffer
Weight buffer
MultipliersAdder Tree
X+
9 Data Inputs
9 Weight
Inputs
n Delays
𝑚 Delays
①
①
②
②
③
③
+
++
⋯
+⋯
X X
X X X
X X X
Input
Data
Input
Weight
Output
Data
Source: Yu Wang, Tsinghua University, Feb 2016 Page 33
Your logohere
Convolver Unit
⋯⋯ ⋯⋯
⋯⋯ ⋯⋯
⋯⋯
MUX
MUX
Data buffer
Weight buffer
MultipliersAdder Tree
X+
9 Data Inputs
9 Weight
Inputs
n Delays
𝑚 Delays
①
①
②
②
③
③
+
++
⋯
+⋯
X X
X X X
X X X
Input
Data
Input
Weight
Output
Data
Memory Proximate Compute2D Parallel Memory2D Operator Array
INT16
Serial to ParallelPing/Pong
Serial to ParallelData Reuse: 8/9
Source: Yu Wang, Tsinghua University, Feb 2016 Page 34
Your logohere
Processing Engine (PE)
C
Convolver
Complex
+
+
+
+
+ NL PoolC
C
Output
Buffer
Input
Buffer
Data
Bias
Weights
Intermediate Data
Controller
PE
Adder
Tree
Bias Shift
Data
shift
……
…
…
Source: Yu Wang, Tsinghua University, Feb 2016 Page 35
Your logohere
Processing Engine (PE)
C
Convolver
Complex
+
+
+
+
+ NL PoolC
C
Output
Buffer
Input
Buffer
Data
Bias
Weights
Intermediate Data
Controller
PE
Adder
Tree
Bias Shift
Data
shift
……
…
…
Memory SharingBroadcast Weights
CustomQuantization
Source: Yu Wang, Tsinghua University, Feb 2016 Page 36
Your logohere
Top Level
Power CPUExternal
Memory
Pro
ce
ssin
g S
ys
tem
DMA w/ compression
Data & Inst. Bus
Input
Buffer
PE
Computing Complex
Output
Buffer
PE PE
FIFO
Co
ntr
oll
er
Pro
gra
mm
ab
le L
og
icConfig
.
Bus
…
Source: Yu Wang, Tsinghua University, Feb 2016 Page 37
Your logohere
Top Level
Power CPUExternal
Memory
Pro
ce
ssin
g S
ys
tem
DMA w/ compression
Data & Inst. Bus
Input
Buffer
PE
Computing Complex
Output
Buffer
PE PE
FIFO
Co
ntr
oll
er
Pro
gra
mm
ab
le L
og
icConfig
.
Bus
…
SW ScheduledDataflow
Decompress weights on the fly
Multiple PEBlock Level Parallelism
Ping Pong BuffersTransfers Overlap with Compute
Source: Yu Wang, Tsinghua University, Feb 2016 Page 38
Your logohere
FPGA Neural Net Processor
Tiled Architecture (Parallelism & Scaling)
Semi Static Dataflow (Pre-scheduled Data Transfers)
Memory Reuse (Data Sharing across Convolvers)
Page 39
Your logohere
OpenPOWER CAPI
Shared Virtual Memory
System-Wide Memory Coherency
Low Latency Control Messages
POWER8CAP UNIT
CAP PSL
Peer Programming Model and Interaction Efficiency
Page 40
Your logohere
OpenPOWER CAPI
POWER8CAP UNIT
CAP PSL
Power
• Caffe, TensorFlow, etc• Load CNN Model• Call AuvizDNN Library
Xilinx FPGA
• AuvizDNN Kernel• Scalable & Fully Parameterized• Plug and Play Library
Page 41
Your logohere
OpenPOWER CAPI
POWER8CAP UNIT
CAP PSL
14 Images/s/W (AlexNet)
Batch Size 1
Low Profile TDP
Page 42
Your logohere
Take Aways
FPGA: Ideal Dataflow CNN Processor
POWER/CAPI: Elevates Accelerators As Peers to CPUs
FPGA CNN Libraries
Page 43
Your logohere
444/11/2016
Thank You!