optimizing fpga-based accelerator design for deep ... 5/1_chen-zhang-fpga20… · design and...
Post on 30-Apr-2020
4 Views
Preview:
TRANSCRIPT
1
Optimizing FPGA-based Accelerator Design for
Deep Convolutional Neural Network
Chen Zhang1, Peng Li3, Guangyu Sun1,2, Yijin Guan1, Bingjun Xiao3, Jason Cong1,2,3
1Peking University2PKU/UCLA Joint Research Institution3University of California, Los Angeles
2
Convolutional Neural Network
Automotive SafetyImage Descriptions @ Stanford
u Image Classification & Object Detection
State-of-the-art Algorithm
• Highest Correctness in Large Scale Visual Recognition Challenge2012 & 2013 & 2014 & 2015
Widely used in industry & academy
3
Convolutional Neural Network
Output feature
map
Input feature map
Convolution
layer
Max-pooling
layer
Convolution
layer
Max-pooling
layerClassifier
Max-pooling is optional
input imagefeature maps feature maps feature maps feature maps
output
category
4
Convolutional Neural Network
Output feature
map
Input feature map
Convolution
layer
Max-pooling
layer
Convolution
layer
Max-pooling
layerClassifier
Max-pooling is optional
input imagefeature maps feature maps feature maps feature maps
output
category
Convolutional layers account for over 90% computation
[1] A. Krizhevsky, etc. Imagenet classification with deep convolutional neural networks. NIPS 2012.
[2] J. Cong and B. Xiao. Minimizing computation in convolutional neural networks. ICANN 2014
5
Related Work
1. CNP: An FPGA-based processor for convolutional networks. FPL 2009;
2. A massively parallel coprocessor for convolutional neural networks. ASAP 2009;
3. A programmable parallel accelerator for learning and classification. PACT 2010;
4. A Dynamically Configurable Coprocessor for Convolutional Neural Networks. ISCA 2010;
5. Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short
paper) FCCM 2011;
6. A massively parallel digital learning processor. NIPS 2008;
7. NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision. CVPRW 2011;
8. Accelerating deep neural networks on mobile processor with embedded programmable logic.
(Poster) NIPS 2013;
9. Memory-Centric Accelerator Design for Convolutional Neural Networks. ICCD 2013.
6
Hardware Computing on FPGA
...
Computation Resource
PE-1 PE-2 PE-n
Inter-
connect
buffer1
On-chip memory
buffer2
External Memory
Off-chip Bus
Challenge: BRAM Limits
Solution: Loop Tiling
Challenge: Resrc. Util.
Solution: Unroll & Pipeline
Challenge: BW Limits
Solution: Data reuse
Challenge: Co-optimization,-- match ‘computation’ & ‘communication’
Solution: Roofline Model
Main
Contribution
7
FPGA design in Roofline Model
...
Computation Resource
PE-1 PE-2 PE-n
Inter-
connect
buffer1
On-chip memory
buffer2
External Memory
Off-chip Bus
Computational Perf.
Bandwidth
Computation To
Communication Ratio
GFLOP/S
FLOP / (Mem. Byte Access)
Gbyte/S
8
FPGA design in Roofline Model
GFLOP/S
FLOP/(Memory byte access)Com
pu
tati
onal
Per
form
ance
Computation To Communication (CTC) Ratio
Design
Bandwidth
9
Computation To Communication (CTC) Ratio
Co
mp
uta
tio
nal
Per
form
ance
GFLOP/S
FLOP/(DRAM byte access)
1
2 3
Attainable Performance
Bandwidth Roof
Computational Roof
A
B
Att
ain
able
Per
form
ance
Attain. Perf. (2 & 3 > 1)
Compt. Perf. (2 & 3 < 1)
Computation
Communication
Computation
Communication
10
Attainable Performance
Computation To Communication (CTC) Ratio
Att
ain
able
Per
form
ance
GFLOPS
FLOP/(DRAM byte access)
1
2 3
Computational Roof
Bandwidth Roof
Attain. Perf. (2 & 3 > 1)
Attain. Perf. (2 & 3 < 1)
11
Attainable Performance
Computation To Communication (CTC) Ratio
Att
ain
able
Per
form
ance
GFLOPS
FLOP/(DRAM byte access)
1
2 3
Computational Roof
Bandwidth Roof
Attain. Perf. (2 & 3 > 1)
Attain. Perf. (2 & 3 < 1)
Optimization Target:Find a Design with Maximized Attainable Performance
12
1 for(row=0; row<R; row+=Tr) {
2 for(col=0; col<C; col+=Tc) {
3 for(to=0; to<M; to+=Tm) {
4 for(ti=0; ti<N; ti+=Tn) {
5 for(trr=row; trr<min(row+Tr, R); trr++) {
6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
7 for(too=to; too<min(to+Tn, M); too++) {
8 for(tii=ti; tii<(ti+Tn, N); tii++) {
9 for(i=0; i<K; i++) {
10 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
}}}}
Loop tiling
(Tile loop)
(Tile loop)
(Tile loop)
(Tile loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
Off-chip Data Transfer: Memory Access Optimization
On-chip Data: Computation Optimization
1 for(row=0; row<R; row++) {
2 for(col=0; col<C; col++) {
3 for(to=0; to<M; to++) {
4 for(ti=0; ti<N; ti++) {
5 for(i=0; i<K; i++) {
6 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
R, C, M, N, K, S are all configuration
parameters of the convolutional layer
𝑤𝑖𝑗1𝑙
K
Input feature mapOutput feature map
13
1 for(row=0; row<R; row+=Tr) {
2 for(col=0; col<C; col+=Tc) {
3 for(to=0; to<M; to+=Tm) {
4 for(ti=0; ti<N; ti+=Tn) {
5 for(trr=row; trr<min(row+Tr, R); trr++) {
6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
7 for(too=to; too<min(to+Tn, M); too++) {
8 for(tii=ti; tii<(ti+Tn, N); tii++) {
9 for(i=0; i<K; i++) {
10 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
}}}}
Loop tiling
(Tile loop)
(Tile loop)
(Tile loop)
(Tile loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
Off-chip Data Transfer: Memory Access Optimization
On-chip Data: Computation Optimization
14
…5 for(trr=row; trr<min(row+Tr, R); trr++)
6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++)
7 for(too=to; too<min(to+Tn, M); too++)
8 for(tii=ti; tii<(ti+Tn, N); tii++)
9 for(i=0; i<K; i++) {
10 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
…
(Tile loops)
(Point loop)(Point loop)(Point loop)(Point loop)
Computation Optimization
5 for(i=0; i<K; i++) {
6 for(j=0; j<K; j++) {
7 for(trr=row; trr<min(row+Tr, R); trr++) {
8 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
#pragma HLS pipeline
9 for(too=to; too<min(to+Tm, M); too++) {
#pragma HLS UNROLL
10 for(tii=ti; tii<(ti+Tn, N); tii++) {
#pragma HLS UNROLL
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
15
…5 for(trr=row; trr<min(row+Tr, R); trr++)
6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++)
7 for(too=to; too<min(to+Tn, M); too++)
8 for(tii=ti; tii<(ti+Tn, N); tii++)
9 for(i=0; i<K; i++) {
10 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
…
(Tile loops)
(Point loop)(Point loop)(Point loop)(Point loop)
Computation Optimization
5 for(i=0; i<K; i++) {
6 for(j=0; j<K; j++) {
7 for(trr=row; trr<min(row+Tr, R); trr++) {
8 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
#pragma HLS pipeline
9 for(too=to; too<min(to+Tm, M); too++) {
#pragma HLS UNROLL
10 for(tii=ti; tii<(ti+Tn, N); tii++) {
#pragma HLS UNROLL
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
Tm
Tn
16
Computational Performance
Output (Tm) 5 10 20 30
Input (Tn) 5 10 20 15
Design 1 Design 2 Design 3 Design 4
DSPs 125 500 2000 2250
Tm
Tn
17
0
10
20
30
40
50
60
70
Computational Performance
Output (Tm) 5 10 20 30
Input (Tn) 5 10 20 15
Design 1 Design 2 Design 3 Design 4
DSPs 125 500 2000 2250
u Consider Memory Access ! How?
18
Memory Access Optimization
1 for(row=0; row<R; row+=Tr) {
2 for(col=0; col<C; col+=Tc) {
3 for(to=0; to<M; to+=Tm) {
4 for(ti=0; ti<N; ti+=Tn) {
}
}}}
S: foo(output_fm(to, row, col));
load output feature map
store output feature map
5 for(trr=row; trr<min(row+Tr, R); trr++)
6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++)
7 for(too=to; too<min(to+Tn, M); too++)
8 for(tii=ti; tii<(ti+Tn, N); tii++)
9 for(i=0; i<K; i++) {
10 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*
input_fm[ti][S*row+i][S*col+j];
}}}}}}
19
Memory Access Optimization
1 for(row=0; row<R; row+=Tr) {
2 for(col=0; col<C; col+=Tc) {
3 for(to=0; to<M; to+=Tm) {
4 for(ti=0; ti<N; ti+=Tn) {
}
}}}
S: foo(output_fm(to, row, col));
load output feature map
store output feature map
1 for(row=0; row<R; row+=Tr) {
2 for(col=0; col<C; col+=Tc) {
3 for(to=0; to<M; to+=Tm) {
4 for(ti=0; ti<N; ti+=Tn) {
}
}}}
S: foo(output_fm(to, row, col));
store output feature map
load output feature map
20
Computation To Communication Ratio
0
10
20
30
40
50
60
70
Output (Tm) 5 10 20 30
Input (Tn) 5 10 20 15
Design 1 Design 2 Design 3 Design 4
Com
puta
tional
Per
form
ance
GFLOP/S
xx
xx
Computation To Communication (CTC) Ratio
FLOP/(DRAM byte access)
xxxx
1
2
34
21
Design Space
Communication:
Legal Solutions
of Tm & Tn:
Legal Solutions:
2097
3# of memory access methods
Total Legal Solutions:
6291
N=192
N=128
# of PE = 450
22
0
20
40
60
80
100
120
0 10 20 30 40 50 60
Design Space Exploration
Computational roof
CTC Ratio
Att
ainab
le p
erfo
rman
ce (
GF
LO
PS
)
Ratio of
(FLOP)/(DRAM byte access)
C
Bandwidth roof
DA
A’
23
System Overview
Accelerator
Data Transfer Engine0
Data Transfer Engine1
FIFO
FIFO
MicroBlaze
AXI4Lite
AXI4
Interrupt controller
Timer
Memory Interface Controller
UART
Main Memory
DDR3 interface
On-chip
Off-chip
Vivado 2013.4
Vivado HLS 2013.4
VC707
uEstimated On-board run Difference
Layer 1 7.32 ms 7.67 ms ~5%
Layer 2 5.11 ms 5.35 ms ~5%
Layer 3 3.42 ms 3.79 ms ~9%
Layer 4 2.59 ms 2.88 ms ~10%
Layer 5 1.73 ms 1.93 ms ~10%
Overall 20.17 ms 21.61 ms ~7%
24
Accelerator
Accelerator
Data Transfer Engine0
Data Transfer Engine1
FIFO
FIFO
MicroBlaze
AXI4Lite
AXI4
Interrupt controller
Timer
Memory Interface Controller
UART
Main Memory
DDR3 interface
On-chip
Off-chip
25
Experimental Results: vs. CPU
0
10
20
30
40
50
60
70
80
90
100
1 thread -O3 16 threads -O3 FPGA
Energy
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1 thread -O3 16 threads -O3 FPGA
Speedup
90x
20x
18x
5x
CPU Xeon E5-2430 (32nm) 16 cores 2.2 GHzgcc 4.7.2 –O3
OpenMP 3.0
FPGA Virtex7-485t (28nm) 448 PEs 100MHzVivado 2013.4
Vivado HLS 2013.4
26
Experimental Results: vs. Other FPGAs
FPL2009 ASAP2009 PACT2010 ISCA2010 ICCD2013 Our Impl.
Virtex 4 Virtex 5 Virtex 5 Virtex 5 Virtex 6 Virtex 7
125MHz 115MHz 125MHz 200MHz 150MHz 100MHz
5.25 GOPS 6.74 GOPS 7 GOPS 16 GOPS 17 GOPS 61.6 GOPS
27
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
ASAP2009 FPL2009 PACT2010 ISCA2010 ICCD2013 Our Impl.
Normalized Performance
Experimental Results: vs. Other FPGAs
~80%
GFLOPS/Slice
28
Experimental Results
0
20
40
60
80
100
120
0 10 20 30 40 50 60
CTC Ratio
Att
ainab
le p
erfo
rman
ce (
GF
LO
PS
)
Ratio of
(FLOP)/(DRAM byte
access)
C
Bandwidth roof
A
A’
Comm. Time NOT Covered
Comm. Time Covered
Computation
Communication
Computation
Communication
29
Conclusions
An accelerator for convolutional neural network
Contribution:
Find the best solution with roofline model
Accurate Analytical model for computation & communication
Result:
~3.5x better performance over other FPGA implementations
~80% performance/area improvement
On-board run implementation
30
Thank You
Q & A
top related