TETRIS: Scalable and Efficient Neural NetworkAcceleration with 3D Memory
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis
Stanford University
Platform Lab Review – Feb 2017
Deep Neural Networks (NNs)
Unprecedented accuracy for challenging applications
System perspective: compute and memory intensive
o Challenging to execute efficiently on conventional systems
Many efforts for optimized NN hardware
2
Classification
Recognition
Control
Prediction
Optimization
Multi-cores
GPUs FPGAs ASICs Clusters
Deep Neural Networks (NNs)
Convolutional layer (CONV)
o Convolutions between fmaps
Fully-Connected layer (FC)
o Matrix multiplication
3
“Dog”
* =
ifmaps ofmapsfilters
Ni
No
No
Ni
foreach b in batch Nbforeach ifmap u in Niforeach ofmap v in No// 2D convO(v,b) += I(u,b) * W(u,v) + B(v)
foreach b in batch Nbforeach neuron x in Nxforeach neuron y in Ny// Matrix multiplyO(y,b) += I(x,b) x W(x,y) + B(v)
𝑂 = 𝐼 ×𝑊
Domain-Specific NN Accelerators
Spatial architectures of PEs
High computational efficiency
o 100x performance and energy efficiency improvement
o Low-precision arithmetic, dynamic pruning, static compression, …
Memory system?
4
foreach b in batch Nbforeach ifmap u in Ni
foreach ofmap v in No// 2D convO(v,b) += I(u,b) * W(u,v) + B(v)
foreach b in batch Nbforeach neuron x in Nx
foreach neuron y in Ny// Matrix multiplyO(y,b) += I(x,b) x W(x,y) + B(v)
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Glo
bal
Bu
ffe
r
ALU
Reg File
Processing ElementM
ain
Me
mo
ry
Memory Challenges for Large NNs
Large memory footprints and high bandwidth requirements
o Deep networks, large layers, complex neuron structures
o More efficient computing requires higher data bandwidth
Limit the performance scalability for future NNs
Common solutions
o Large on-chip buffers less area available for compute elements
o Multiple DRAM channels high dynamic power, significant leakage
5
Memory Challenges for Large NNs
Example: a state-of-the-art accelerator with 400 PEs needs …
o 1.5 MB SRAM buffer, > 70% of accelerator area
o Four LPDDR3 x32 chips with > 25 GBps
o 45% power in SRAM and DRAM even with low-power technologies
6
0
5
10
15
20
25
30
0
0.5
1
1.5
2
2.5
3
1 2 3 4 5 6 7 8 9 10 11
Ban
dw
idth
(GB
ps)
Po
we
r(W
)
Number of PEs
Series1 Series2 Series3 Series4 Series5
3D Memory + NN Acceleration
Opportunities
o High bandwidth at low access energy
o Abundant parallelism (vaults, banks)
Key questions
o What hardware architecture best exploits the high bandwidth and low energy?
o How to best schedule and partition NN computations in software?• Schedule: dataflow across memory hierarchy
• Partition: parallelism across vaults7
Micron’s Hybrid Memory Cube
Vault(Channel)
DRAM Die
Logic Die
Bank
TSVs
TETRIS
NN acceleration with 3D memory
o Improves performance scalability by 4.1x over 2D
o Improves energy efficiency by 1.5x over 2D
Hardware architecture
o Rebalance resources between PEs and buffers
o In-memory accumulation
Software techniques
o Analytical dataflow scheduling
o Hybrid partitioning
8
High performance & low energy
Alleviate bandwidth pressure
Optimize buffer use
Efficient parallel processing
TETRIS Hardware Architecture
TETRIS Architecture
Associate one NN engine with each vault
o PE array, local register files, and a shared global buffer
NoC + routers for accesses to remote vaults
All vaults can process NN computations in parallel
10
Vault DRAM Die
Logic Die
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Glo
bal
Bu
ffe
r
Me
mo
ry C
on
tro
ller
Ro
ute
r
To local vault
To remote vault
Resource Balancing
Larger PE arrays with smaller SRAM buffers
o Limited area with 3D integration
o High data bandwidth needs more PEs
o Low access energy & sequential access pattern
11
0
2
4
6
0
0.6
1.2
1.8
1 2 3 4 5 6 7 8 9 10
No
rmal
ize
d R
un
tim
e
No
rmal
ize
d E
ne
rgy
# PEs / buffer capacity
Series1 Series2 Series3 Series4 Series5
196 PEs with 133 kB buffer (area 1:1)better performance and energy
In-Memory Accumulation
Small buffer more swap to memory higher bandwidth
Move simple accumulation logic close to DRAM banks
o 2x bandwidth reduction for output data
o Carefully choose logic location to avoid impact on DRAM array layout
12
PE array
Memory
Y += W * X
XW Y Y
PE array
Memory
ΔY = W * X
XW ΔY
+
Scheduling and Partitioning for TETRIS
Dataflow Scheduling
Critical for maximizing on-chip data reuse to save energy
14
foreach b in batch Nbforeach ifmap u in Niforeach ofmap v in No
// 2D convO(v,b) += I(u,b) * W(u,v) + B(v)
Mapping: execute 2D conv on PE array
• Regfiles and array interconnect
• Row stationary [Chen et al., ISCA’16]
Ordering: loop blocking and reordering
• Locality in global buffer
• Non-convex, exhaustive search
Few data reuse opportunities in TETRIS small buffers
IW bypass, OW bypass, IO bypass
o Use buffer only for one stream for maximum benefit
o Bypass buffer for the other two to sacrifice their reuse
ifmaps ofmaps filters
Global buffer
Reg files
OW bypass ordering
Off-chip
On-chip
TETRIS Bypass Ordering
15
Chunk 0
Chunk 1
Chunk 2
1. Read 1 ifmap chunk into gbuf2. Stream ofmaps and filters to regf3. Move ifmaps from gbuf to regf4. Convolve
TETRIS Bypass Ordering
Near-optimal schedules
o With 2% difference from schedules derived from exhaustive search
Analytically derived
o Closed-form solution
o No need for time-consuming exhaustive search
16
min𝐴DRAM= 2 × 𝑁b𝑁o𝑆o × 𝑡i + 𝑁b𝑁i𝑆i + 𝑁o𝑁i𝑆w × 𝑡b
s.t. ൞
𝑁b𝑡b
×𝑁i𝑡i× 𝑆i ≤ 𝑆buf
1 ≤ 𝑡b ≤ 𝑁b, 1 ≤ 𝑡i ≤ 𝑁i
NNRuntime Gap
(w.r.t. optimal)Energy Gap
(w.r.t. optimal)
AlexNet 1.48 % 1.86 %
ZFNet 1.55 % 1.83 %
VGG16 0.16 % 0.20 %
VGG19 0.13 % 0.16 %
ResNet 2.91 % 0.78 %
NN Partitioning
Fmap partitioning
o Divide a fmap into tiles
o Each vault processes one tile
Minimum remote accesses
o Prior work exclusively used fmappartitioning
17
Layer i Layer i+1
Vault 0 Vault 1
Vault 2 Vault 3
Process NN computations in parallel in all vaults
NN Partitioning
Output partitioning
o Partition all ofmaps into groups
o Each vault processes one group
Better filter weight reuse
Fewer total memory accesses
18
Layer i Layer i+1
Vault 0Vault 1
Vault 2
Vault 3
Process NN computations in parallel in all vaults
TETRIS Hybrid Partitioning
Combine fmap partitioning and output partitioning
Total energy = NoC energy + DRAM energy
o Balance between minimizing remote accesses and total DRAM accesses
Difficulties
o Design space exponential to # layers
Greedy algorithm reduces to be linear to # layers
o Involved dataflow scheduling to determine total DRAM accesses
Bypass ordering to quickly estimate total DRAM accesses
19
TETRIS Evaluation
Methodology
Five state-of-the-art NNs
o AlexNet, ZFNet, VGG16, VGG19, ResNet
o 100—300 MB total memory footprint for each NN
o Up to 152 layers in ResNet, one of the largest NNs today
2D and 3D accelerators with one or more NN engines
o 2D engine: 16 x 16 PEs, 576 kB buffer, one LPDDR3 channel
• 8.5 mm2, 51.2 Gops/sec
• Bandwidth-constrained
o 3D engine: 14 x 14 PEs, 133 kB buffer, one HMC vault
• 3.5 mm2, 39.2 Gops/sec
• Area-constrained21
Single-Engine Comparison
Up to 37% performance improvement with TETRIS
o Due to higher bandwidth despite smaller PE array
22
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14
No
rmal
ize
d R
un
tim
e
Large NNs benefit more!
AlexNet ZFNet VGG16 VGG19 ResNet
35–40% energy reduction with TETRIS
o Smaller on-chip buffer, better scheduling
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14
No
rmal
ize
d E
ne
rgy
Series5
Series4
Series3
Series2
Series1
Single-Engine Comparison
23
From SRAM & DRAM, and static
AlexNet ZFNet VGG16 VGG19 ResNet
Multi-Engine Comparison
4 2D engines: 34 mm2, pin constrained (4 LPDDR3 channels)
16 3D engines: 56 mm2, area constrained (16 HMC vaults)
4.1x performance gain 2x compute density
24
0
0.05
0.1
0.15
0.2
0.25
0.3
1 2 3 4 5 6 7 8 9 10 11 12 13 14
No
rmal
ize
d R
un
tim
e
AlexNet ZFNet VGG16 VGG19 ResNet
Multi-Engine Comparison
1.5x lower energy
o 1.2x from better scheduling and partitioning
4x computation only costs 2.7x power
25
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14
No
rmal
ize
d E
ne
rgy
Series5
Series4
Series3
Series2
Series1
AlexNet ZFNet VGG16 VGG19 ResNet
TETRIS Summary
A scalable and efficient NN accelerator using 3D memory
o 4.1x performance and 1.5x energy benefits over 2D baseline
Hardware features
o PE/buffer area rebalancing
o In-memory accumulation
Software features
o Analytical dataflow scheduling
o Hybrid partitioning
26
Thanks!Questions?