![Page 1: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/1.jpg)
1
Optimizing FPGA-based Accelerator Design for
Deep Convolutional Neural Network
Chen Zhang1, Peng Li3, Guangyu Sun1,2, Yijin Guan1, Bingjun Xiao3, Jason Cong1,2,3
1Peking University2PKU/UCLA Joint Research Institution3University of California, Los Angeles
![Page 2: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/2.jpg)
2
Convolutional Neural Network
Automotive SafetyImage Descriptions @ Stanford
u Image Classification & Object Detection
State-of-the-art Algorithm
• Highest Correctness in Large Scale Visual Recognition Challenge2012 & 2013 & 2014 & 2015
Widely used in industry & academy
![Page 3: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/3.jpg)
3
Convolutional Neural Network
Output feature
map
Input feature map
Convolution
layer
Max-pooling
layer
Convolution
layer
Max-pooling
layerClassifier
Max-pooling is optional
input imagefeature maps feature maps feature maps feature maps
output
category
![Page 4: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/4.jpg)
4
Convolutional Neural Network
Output feature
map
Input feature map
Convolution
layer
Max-pooling
layer
Convolution
layer
Max-pooling
layerClassifier
Max-pooling is optional
input imagefeature maps feature maps feature maps feature maps
output
category
Convolutional layers account for over 90% computation
[1] A. Krizhevsky, etc. Imagenet classification with deep convolutional neural networks. NIPS 2012.
[2] J. Cong and B. Xiao. Minimizing computation in convolutional neural networks. ICANN 2014
![Page 5: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/5.jpg)
5
Related Work
1. CNP: An FPGA-based processor for convolutional networks. FPL 2009;
2. A massively parallel coprocessor for convolutional neural networks. ASAP 2009;
3. A programmable parallel accelerator for learning and classification. PACT 2010;
4. A Dynamically Configurable Coprocessor for Convolutional Neural Networks. ISCA 2010;
5. Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short
paper) FCCM 2011;
6. A massively parallel digital learning processor. NIPS 2008;
7. NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision. CVPRW 2011;
8. Accelerating deep neural networks on mobile processor with embedded programmable logic.
(Poster) NIPS 2013;
9. Memory-Centric Accelerator Design for Convolutional Neural Networks. ICCD 2013.
![Page 6: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/6.jpg)
6
Hardware Computing on FPGA
...
Computation Resource
PE-1 PE-2 PE-n
Inter-
connect
buffer1
On-chip memory
buffer2
External Memory
Off-chip Bus
Challenge: BRAM Limits
Solution: Loop Tiling
Challenge: Resrc. Util.
Solution: Unroll & Pipeline
Challenge: BW Limits
Solution: Data reuse
Challenge: Co-optimization,-- match ‘computation’ & ‘communication’
Solution: Roofline Model
Main
Contribution
![Page 7: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/7.jpg)
7
FPGA design in Roofline Model
...
Computation Resource
PE-1 PE-2 PE-n
Inter-
connect
buffer1
On-chip memory
buffer2
External Memory
Off-chip Bus
Computational Perf.
Bandwidth
Computation To
Communication Ratio
GFLOP/S
FLOP / (Mem. Byte Access)
Gbyte/S
![Page 8: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/8.jpg)
8
FPGA design in Roofline Model
GFLOP/S
FLOP/(Memory byte access)Com
pu
tati
onal
Per
form
ance
Computation To Communication (CTC) Ratio
Design
Bandwidth
![Page 9: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/9.jpg)
9
Computation To Communication (CTC) Ratio
Co
mp
uta
tio
nal
Per
form
ance
GFLOP/S
FLOP/(DRAM byte access)
1
2 3
Attainable Performance
Bandwidth Roof
Computational Roof
A
B
Att
ain
able
Per
form
ance
Attain. Perf. (2 & 3 > 1)
Compt. Perf. (2 & 3 < 1)
Computation
Communication
Computation
Communication
![Page 10: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/10.jpg)
10
Attainable Performance
Computation To Communication (CTC) Ratio
Att
ain
able
Per
form
ance
GFLOPS
FLOP/(DRAM byte access)
1
2 3
Computational Roof
Bandwidth Roof
Attain. Perf. (2 & 3 > 1)
Attain. Perf. (2 & 3 < 1)
![Page 11: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/11.jpg)
11
Attainable Performance
Computation To Communication (CTC) Ratio
Att
ain
able
Per
form
ance
GFLOPS
FLOP/(DRAM byte access)
1
2 3
Computational Roof
Bandwidth Roof
Attain. Perf. (2 & 3 > 1)
Attain. Perf. (2 & 3 < 1)
Optimization Target:Find a Design with Maximized Attainable Performance
![Page 12: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/12.jpg)
12
1 for(row=0; row<R; row+=Tr) {
2 for(col=0; col<C; col+=Tc) {
3 for(to=0; to<M; to+=Tm) {
4 for(ti=0; ti<N; ti+=Tn) {
5 for(trr=row; trr<min(row+Tr, R); trr++) {
6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
7 for(too=to; too<min(to+Tn, M); too++) {
8 for(tii=ti; tii<(ti+Tn, N); tii++) {
9 for(i=0; i<K; i++) {
10 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
}}}}
Loop tiling
(Tile loop)
(Tile loop)
(Tile loop)
(Tile loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
Off-chip Data Transfer: Memory Access Optimization
On-chip Data: Computation Optimization
1 for(row=0; row<R; row++) {
2 for(col=0; col<C; col++) {
3 for(to=0; to<M; to++) {
4 for(ti=0; ti<N; ti++) {
5 for(i=0; i<K; i++) {
6 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
R, C, M, N, K, S are all configuration
parameters of the convolutional layer
𝑤𝑖𝑗1𝑙
K
Input feature mapOutput feature map
![Page 13: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/13.jpg)
13
1 for(row=0; row<R; row+=Tr) {
2 for(col=0; col<C; col+=Tc) {
3 for(to=0; to<M; to+=Tm) {
4 for(ti=0; ti<N; ti+=Tn) {
5 for(trr=row; trr<min(row+Tr, R); trr++) {
6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
7 for(too=to; too<min(to+Tn, M); too++) {
8 for(tii=ti; tii<(ti+Tn, N); tii++) {
9 for(i=0; i<K; i++) {
10 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
}}}}
Loop tiling
(Tile loop)
(Tile loop)
(Tile loop)
(Tile loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
(Point loop)
Off-chip Data Transfer: Memory Access Optimization
On-chip Data: Computation Optimization
![Page 14: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/14.jpg)
14
…5 for(trr=row; trr<min(row+Tr, R); trr++)
6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++)
7 for(too=to; too<min(to+Tn, M); too++)
8 for(tii=ti; tii<(ti+Tn, N); tii++)
9 for(i=0; i<K; i++) {
10 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
…
(Tile loops)
(Point loop)(Point loop)(Point loop)(Point loop)
Computation Optimization
5 for(i=0; i<K; i++) {
6 for(j=0; j<K; j++) {
7 for(trr=row; trr<min(row+Tr, R); trr++) {
8 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
#pragma HLS pipeline
9 for(too=to; too<min(to+Tm, M); too++) {
#pragma HLS UNROLL
10 for(tii=ti; tii<(ti+Tn, N); tii++) {
#pragma HLS UNROLL
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
![Page 15: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/15.jpg)
15
…5 for(trr=row; trr<min(row+Tr, R); trr++)
6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++)
7 for(too=to; too<min(to+Tn, M); too++)
8 for(tii=ti; tii<(ti+Tn, N); tii++)
9 for(i=0; i<K; i++) {
10 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
…
(Tile loops)
(Point loop)(Point loop)(Point loop)(Point loop)
Computation Optimization
5 for(i=0; i<K; i++) {
6 for(j=0; j<K; j++) {
7 for(trr=row; trr<min(row+Tr, R); trr++) {
8 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
#pragma HLS pipeline
9 for(too=to; too<min(to+Tm, M); too++) {
#pragma HLS UNROLL
10 for(tii=ti; tii<(ti+Tn, N); tii++) {
#pragma HLS UNROLL
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
Tm
Tn
![Page 16: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/16.jpg)
16
Computational Performance
Output (Tm) 5 10 20 30
Input (Tn) 5 10 20 15
Design 1 Design 2 Design 3 Design 4
DSPs 125 500 2000 2250
Tm
Tn
![Page 17: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/17.jpg)
17
0
10
20
30
40
50
60
70
Computational Performance
Output (Tm) 5 10 20 30
Input (Tn) 5 10 20 15
Design 1 Design 2 Design 3 Design 4
DSPs 125 500 2000 2250
u Consider Memory Access ! How?
![Page 18: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/18.jpg)
18
Memory Access Optimization
1 for(row=0; row<R; row+=Tr) {
2 for(col=0; col<C; col+=Tc) {
3 for(to=0; to<M; to+=Tm) {
4 for(ti=0; ti<N; ti+=Tn) {
}
}}}
S: foo(output_fm(to, row, col));
load output feature map
store output feature map
5 for(trr=row; trr<min(row+Tr, R); trr++)
6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++)
7 for(too=to; too<min(to+Tn, M); too++)
8 for(tii=ti; tii<(ti+Tn, N); tii++)
9 for(i=0; i<K; i++) {
10 for(j=0; j<K; j++) {
output_fm[to][row][col] +=
weights[to][ti][i][j]*
input_fm[ti][S*row+i][S*col+j];
}}}}}}
![Page 19: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/19.jpg)
19
Memory Access Optimization
1 for(row=0; row<R; row+=Tr) {
2 for(col=0; col<C; col+=Tc) {
3 for(to=0; to<M; to+=Tm) {
4 for(ti=0; ti<N; ti+=Tn) {
}
}}}
S: foo(output_fm(to, row, col));
load output feature map
store output feature map
1 for(row=0; row<R; row+=Tr) {
2 for(col=0; col<C; col+=Tc) {
3 for(to=0; to<M; to+=Tm) {
4 for(ti=0; ti<N; ti+=Tn) {
}
}}}
S: foo(output_fm(to, row, col));
store output feature map
load output feature map
![Page 20: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/20.jpg)
20
Computation To Communication Ratio
0
10
20
30
40
50
60
70
Output (Tm) 5 10 20 30
Input (Tn) 5 10 20 15
Design 1 Design 2 Design 3 Design 4
Com
puta
tional
Per
form
ance
GFLOP/S
xx
xx
Computation To Communication (CTC) Ratio
FLOP/(DRAM byte access)
xxxx
1
2
34
![Page 21: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/21.jpg)
21
Design Space
Communication:
Legal Solutions
of Tm & Tn:
Legal Solutions:
2097
3# of memory access methods
Total Legal Solutions:
6291
N=192
N=128
# of PE = 450
![Page 22: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/22.jpg)
22
0
20
40
60
80
100
120
0 10 20 30 40 50 60
Design Space Exploration
Computational roof
CTC Ratio
Att
ainab
le p
erfo
rman
ce (
GF
LO
PS
)
Ratio of
(FLOP)/(DRAM byte access)
C
Bandwidth roof
DA
A’
![Page 23: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/23.jpg)
23
System Overview
Accelerator
Data Transfer Engine0
Data Transfer Engine1
FIFO
FIFO
MicroBlaze
AXI4Lite
AXI4
Interrupt controller
Timer
Memory Interface Controller
UART
Main Memory
DDR3 interface
On-chip
Off-chip
Vivado 2013.4
Vivado HLS 2013.4
VC707
uEstimated On-board run Difference
Layer 1 7.32 ms 7.67 ms ~5%
Layer 2 5.11 ms 5.35 ms ~5%
Layer 3 3.42 ms 3.79 ms ~9%
Layer 4 2.59 ms 2.88 ms ~10%
Layer 5 1.73 ms 1.93 ms ~10%
Overall 20.17 ms 21.61 ms ~7%
![Page 24: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/24.jpg)
24
Accelerator
Accelerator
Data Transfer Engine0
Data Transfer Engine1
FIFO
FIFO
MicroBlaze
AXI4Lite
AXI4
Interrupt controller
Timer
Memory Interface Controller
UART
Main Memory
DDR3 interface
On-chip
Off-chip
![Page 25: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/25.jpg)
25
Experimental Results: vs. CPU
0
10
20
30
40
50
60
70
80
90
100
1 thread -O3 16 threads -O3 FPGA
Energy
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1 thread -O3 16 threads -O3 FPGA
Speedup
90x
20x
18x
5x
CPU Xeon E5-2430 (32nm) 16 cores 2.2 GHzgcc 4.7.2 –O3
OpenMP 3.0
FPGA Virtex7-485t (28nm) 448 PEs 100MHzVivado 2013.4
Vivado HLS 2013.4
![Page 26: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/26.jpg)
26
Experimental Results: vs. Other FPGAs
FPL2009 ASAP2009 PACT2010 ISCA2010 ICCD2013 Our Impl.
Virtex 4 Virtex 5 Virtex 5 Virtex 5 Virtex 6 Virtex 7
125MHz 115MHz 125MHz 200MHz 150MHz 100MHz
5.25 GOPS 6.74 GOPS 7 GOPS 16 GOPS 17 GOPS 61.6 GOPS
![Page 27: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/27.jpg)
27
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
ASAP2009 FPL2009 PACT2010 ISCA2010 ICCD2013 Our Impl.
Normalized Performance
Experimental Results: vs. Other FPGAs
~80%
GFLOPS/Slice
![Page 28: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/28.jpg)
28
Experimental Results
0
20
40
60
80
100
120
0 10 20 30 40 50 60
CTC Ratio
Att
ainab
le p
erfo
rman
ce (
GF
LO
PS
)
Ratio of
(FLOP)/(DRAM byte
access)
C
Bandwidth roof
A
A’
Comm. Time NOT Covered
Comm. Time Covered
Computation
Communication
Computation
Communication
![Page 29: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/29.jpg)
29
Conclusions
An accelerator for convolutional neural network
Contribution:
Find the best solution with roofline model
Accurate Analytical model for computation & communication
Result:
~3.5x better performance over other FPGA implementations
~80% performance/area improvement
On-board run implementation
![Page 30: Optimizing FPGA-based Accelerator Design for Deep ... 5/1_Chen-Zhang-fpga20… · Design and Implementation of an FPGA-based Real-Time Face Recognition System. (short paper) FCCM](https://reader034.vdocuments.net/reader034/viewer/2022042208/5eab3cc729d62d34c7459d12/html5/thumbnails/30.jpg)
30
Thank You
Q & A