draco: co-optimizing hardware utilization, and performance
TRANSCRIPT
DRACO: Co-Optimizing Hardware Utilization, andPerformance of DNNs on Systolic Accelerator
Nandan Kumar Jha⋆, Shreyas Ravishankar†, Sparsh Mittal‡, Arvind Kaushik§,Dipan Mandal¶, Mahesh Chandra§
⋆IIT Hyderabad, †BITS Pilani Hyderabad, ‡IIT Roorkee, §NXP Semiconductors,
¶Intel Labs, India. (⋆[email protected])
IEEE Computer Society Annual Symposium on VLSI 2020
July 8, 2020
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 1 / 13
Challenge: PE underutilization in memory-boundDNNs
Memory-bound DNNs: compute-efficient, but low data reuse. Requireshigh bandwidth hence most of PEs remain underutilized on a systolicarray based DNN accelerator designed for large compute-bound DNN.
0.25
1
4
16
64
L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 GMean
PE
uti
liza
tio
n (
%)
conv3x3_dwconv1x1_pw
Figure 1 : MobileNet-V1 on Eyeriss accelerator with 16 × 16 array and row-stationary dataflow
PE utilization of 1×1 conv is ≈80% in almost all layers while in 3×3 depthwise
conv is ≈4% and decreases in deeper layers (even lower than 1%)
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 2 / 13
Need flexible DNN accelerators toefficiently support a diverse range
compute-bound and memory-bound DNNs
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 3 / 13
Previous WorkProposed solution in previous works are either hardware-based
optimization or dataflow modification.
[Flexible NOC (Eyeriss-v2), Chen et al., JETCAS’19]
[Adaptive Tiling, Kung et al., ICPR’18]
[Global PE (Simba), Shao et al., MICRO’19]
[Flexible scanning, Wu et al., DATE’19]
Pitfalls: Lack of flexibility (due to fixed functionality) in hardware based
solution and the longer development cycles could render the proposed solution
ineffective for SOTA DNNs.
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 4 / 13
Do we really need a specializedmicroarchitecture and/or dataflow to solvethe low data reusability and PE underutilization
challenges of depthwise convolution?
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 5 / 13
Solution?
Can we solve this problem at algorithm side (while preservingthe predictive performance and energy efficiency of the
network)?
Algorithmic optimization
Performance (e.g., latency/ throughput)
Energy-efficiency(Ops/Joule)
Performance (e.g.,latency, throughput, PE
utilization)
Predictive performance(e.g., Top-1 accuracy)
Energy-efficiency(MACs/Joule)
Traditional workloads
DNNs
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 6 / 13
DRACO (Data Reuse Aware Co-Optimization)
Table: dk × dk and df × df are the spatial size of kernel (2D-filter) and feature map(ifmap/ofmap). m, n, and G are the #ifmaps, #ofmaps, and #channel per group
Metric SConv DWConv DRACO#MACs m × n × d2
k × d2f m × d2
k × d2f G × (n × d2
k × d2f )
#Param m × n × d2k m × d2
k G × (n × d2k )
WeightReuse d2f d2
f d2f
ActivationReuse n × ( mm+n)d2
k ( mm+n)d2
k G × ( nm+n)d2
k
DRACO is a sweet point between SConv and DWConv and enables a trade-off
between computational complexity and data reuse.
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 7 / 13
Experimental Results: MobileNetV1 on Eyeriss (1/2)
0
8
16
24
32
40
G1 G2 G4 G8 G16 G32
Array size = 16x16
Lat
ency
(m
s)
0
2
4
6
8
10
G1 G2 G4 G8 G16 G32
Array size = 32x32
Lat
ency
(m
s)
0
1
2
3
4
G1 G2 G4 G8 G16 G32
Array size = 64x64
Lat
ency
(m
s)
0
0.4
0.8
1.2
1.6
2
G1 G2 G4 G8 G16 G32
Array size = 128x128
Lat
ency
(m
s)
0
20
40
60
80
100
G1 G2 G4 G8 G16 G32
PE
uti
liza
tion (
%)
3x3 1x1 Avg
0
20
40
60
80
100
G1 G2 G4 G8 G16 G32
PE
uti
lizati
on (
%)
0
20
40
60
80
100
G1 G2 G4 G8 G16 G32
PE
uti
lizati
on (
%)
0
20
40
60
80
100
G1 G2 G4 G8 G16 G32
PE
uti
lizati
on (
%)
0
3
6
9
12
15
G1 G2 G4 G8 G16 G32
Ener
gy (
mJ)
alu
dram
g_buf
array
rf
0
3
6
9
12
15
G1 G2 G4 G8 G16 G32
Ener
gy (
mJ)
0
3
6
9
12
15
G1 G2 G4 G8 G16 G32
Ener
gy (
mJ)
0
3
6
9
12
15
G1 G2 G4 G8 G16 G32
Ener
gy (
mJ)
Takeaway 1: Optimum latency depends on both the PE utilizationand the computational complexity of DNN, and the effect of PE
utilization on latency depends on PE array size in the systolic array.
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 8 / 13
Experimental Results: MobileNetV1 on Eyeriss (2/2)
0
0.3
0.6
0.9
1.2
1.5
G1 G2 G4 G8 G16
MV1 (α = 0.5, ρ = 1 )
Lat
ency
(m
s)
0
2
4
6
8
10
G1 G2 G4 G8 G16 G32 G64
MV1 (α = 2, ρ = 1 )
Lat
ency
(m
s)
0
0.3
0.6
0.9
1.2
1.5
G1 G2 G4 G8 G16 G32
MV1 (α = 1, ρ = 0.5 )
Lat
ency
(m
s)
0
2
4
6
8
10
G1 G2 G4 G8 G16 G32
MV1 (α = 1, ρ = 2 )
Lat
ency
(m
s)
0
20
40
60
80
100
G1 G2 G4 G8 G16
PE
uti
liza
tion (
%)
3x3 1x1 Avg
0
20
40
60
80
100
G1 G2 G4 G8 G16 G32 G64
PE
uti
lizati
on (
%)
0
20
40
60
80
100
G1 G2 G4 G8 G16 G32
PE
uti
liza
tion (
%)
3x3 1x1 Avg
0
20
40
60
80
100
G1 G2 G4 G8 G16 G32
PE
uti
lizati
on (
%)
0
0.5
1
1.5
2
2.5
3
G1 G2 G4 G8 G16
Ener
gy (
mJ)
alu
dram
g_buf
array
rf
0
5
10
15
20
25
30
G1 G2 G4 G8 G16 G32 G64
Ener
gy (
mJ)
0
1
2
3
4
5
6
G1 G2 G4 G8 G16 G32
Ener
gy (
mJ)
alu
dram
g_buf
array
rf
0
10
20
30
40
50
G1 G2 G4 G8 G16 G32
Ener
gy (
mJ)
[α is width multiplier (↑ #filter-channels) and ρ is a resolution multiplier (↑ spatial size of fmaps). Array size here is 64 × 64]
Takeaway 2: Inference latency of DNNs with very few MACs dependsonly on PE utilization. Increasing PE utilization at the expense of a
substantial increase in #MACs does not lower the latency as the effectof higher PE utilization is dominated by the #MACs.
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 9 / 13
Alternative for Latency Optimization
MobileNetV1 with α=0.5 and ρ=2: Reduces the computationaloverhead of improving PE utilization by reducing the #filter-channels
Performance comparison of MobileNetV1 versions {α = 1, and ρ = 2} and {α = 0.5, and ρ = 2}
Model Array size Metric G1 G2 G4 G8 G16
MV1α=1,ρ=2
16x16PE util. (%) 68 77 79 79 80Latency (ms) 66.5 67.7 72.1 81.1 99.2Energy (mJ) 59.7 60.1 61.0 63.7 69.3
64x64PE util. (%) 50 55 66 74 83Latency (ms) 6.9 5.5 4.9 5.2 6.0Energy (mJ) 30.6 31.1 31.9 34.3 39.3
MV1α=0.5,ρ=2
16x16PE util. (%) 68 76 79 79 80Latency (ms) 17.8 18.3 20.5 25.1 34.1Energy (mJ) 17.5 17.7 18.2 19.5 21.6
64x64PE util. (%) 49 54 66 73 81Latency (ms) 2.5 1.8 1.5 1.7 2.1Energy (mJ) 10.3 10.5 10.9 12.1 14.1
Latency is decreased by a factor of ≈3.5× on 16 × 16 systolic array andby ≈3.27× on 64 × 64 array size at G=4.
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 10 / 13
Predictive performance comparison
Top-1 accuracy (on Imagenette) for MobileNetV1 with different α and ρModels G1 G2 G4 G8 G16 G32
MV1 (α=1 ρ=1) 84.08 84.55 84.65 83.46 83.40 79.94
MV1 (α=1 ρ=2) 84.76 84.55 84.17 84.81 83.29 82.90
MV1 (α=0.5 ρ=2) 82.61 83.54 83.70 82.71 82.29 -
Changing G affects regularization.- Lower (higher) G Ô⇒ stronger (weaker) regularization.
Changing G alters the representational power of network.- Higher G captures more variations of complex latent concepts.
Takeaway 3: With appropriate G we can get maximum predictiveperformance.
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 11 / 13
Conclusion
Increasing PE utilization does not necessarily reduces the latency.- Effect of PE utilization on latency is dictated by the computationalcomplexity of DNN and the systolic array-size.
Increasing data reuse at the expense of higher computation does notaffect energy efficiency significantly.- At lower G energy consumption is almost constant; however, athigher G it increases moderately.
Predictive performance can be increased by fine-tuning G .
By fine-tunning G in DNNs with depthwise convolution, a sweetpoint for optimum latency, energy-efficiency, and accuracy canbe achieved on a systolic DNN accelerator designed for large(compute-bound) DNNs
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 12 / 13
Thanks for Paying Attention..!!!
ISVLSI 2020 DRACO (paper ID: 153) July 8, 2020 13 / 13