Download - TETRIS: Scalable and Efficient Neural Network Acceleration ... › pdf › Mingyu_Gao.pdfTETRIS Bypass Ordering Near-optimal schedules o With 2% difference from schedules derived from

TETRIS: Scalable and Efficient Neural NetworkAcceleration with 3D Memory

Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis

Stanford University

Platform Lab Review – Feb 2017

Deep Neural Networks (NNs)

Unprecedented accuracy for challenging applications

System perspective: compute and memory intensive

o Challenging to execute efficiently on conventional systems

Many efforts for optimized NN hardware

2

Classification

Recognition

Control

Prediction

Optimization

Multi-cores

GPUs FPGAs ASICs Clusters

Deep Neural Networks (NNs)

Convolutional layer (CONV)

o Convolutions between fmaps

Fully-Connected layer (FC)

o Matrix multiplication

3

“Dog”

* =

ifmaps ofmapsfilters

Ni

No

No

Ni

foreach b in batch Nbforeach ifmap u in Niforeach ofmap v in No// 2D convO(v,b) += I(u,b) * W(u,v) + B(v)

foreach b in batch Nbforeach neuron x in Nxforeach neuron y in Ny// Matrix multiplyO(y,b) += I(x,b) x W(x,y) + B(v)

𝑂 = 𝐼 ×𝑊

Domain-Specific NN Accelerators

Spatial architectures of PEs

High computational efficiency

o 100x performance and energy efficiency improvement

o Low-precision arithmetic, dynamic pruning, static compression, …

Memory system?

4

foreach b in batch Nbforeach ifmap u in Ni

foreach ofmap v in No// 2D convO(v,b) += I(u,b) * W(u,v) + B(v)

foreach b in batch Nbforeach neuron x in Nx

foreach neuron y in Ny// Matrix multiplyO(y,b) += I(x,b) x W(x,y) + B(v)

PE PE PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

Glo

bal

Bu

ffe

r

ALU

Reg File

Processing ElementM

ain

Me

mo

ry

Memory Challenges for Large NNs

Large memory footprints and high bandwidth requirements

o Deep networks, large layers, complex neuron structures

o More efficient computing requires higher data bandwidth

Limit the performance scalability for future NNs

Common solutions

o Large on-chip buffers less area available for compute elements

o Multiple DRAM channels high dynamic power, significant leakage

5

Memory Challenges for Large NNs

Example: a state-of-the-art accelerator with 400 PEs needs …

o 1.5 MB SRAM buffer, > 70% of accelerator area

o Four LPDDR3 x32 chips with > 25 GBps

o 45% power in SRAM and DRAM even with low-power technologies

6

0

5

10

15

20

25

30

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10 11

Ban

dw

idth

(GB

ps)

Po

we

r(W

)

Number of PEs

Series1 Series2 Series3 Series4 Series5

3D Memory + NN Acceleration

Opportunities

o High bandwidth at low access energy

o Abundant parallelism (vaults, banks)

Key questions

o What hardware architecture best exploits the high bandwidth and low energy?

o How to best schedule and partition NN computations in software?• Schedule: dataflow across memory hierarchy

• Partition: parallelism across vaults7

Micron’s Hybrid Memory Cube

Vault(Channel)

DRAM Die

Logic Die

Bank

TSVs

TETRIS

NN acceleration with 3D memory

o Improves performance scalability by 4.1x over 2D

o Improves energy efficiency by 1.5x over 2D

Hardware architecture

o Rebalance resources between PEs and buffers

o In-memory accumulation

Software techniques

o Analytical dataflow scheduling

o Hybrid partitioning

8

High performance & low energy

Alleviate bandwidth pressure

Optimize buffer use

Efficient parallel processing

TETRIS Hardware Architecture

TETRIS Architecture

Associate one NN engine with each vault

o PE array, local register files, and a shared global buffer

NoC + routers for accesses to remote vaults

All vaults can process NN computations in parallel

10

Vault DRAM Die

Logic Die

PE PE PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

Glo

bal

Bu

ffe

r

Me

mo

ry C

on

tro

ller

Ro

ute

r

To local vault

To remote vault

Resource Balancing

Larger PE arrays with smaller SRAM buffers

o Limited area with 3D integration

o High data bandwidth needs more PEs

o Low access energy & sequential access pattern

11

0

2

4

6

0

0.6

1.2

1.8

1 2 3 4 5 6 7 8 9 10

No

rmal

ize

d R

un

tim

e

No

rmal

ize

d E

ne

rgy

# PEs / buffer capacity

Series1 Series2 Series3 Series4 Series5

196 PEs with 133 kB buffer (area 1:1)better performance and energy

In-Memory Accumulation

Small buffer more swap to memory higher bandwidth

Move simple accumulation logic close to DRAM banks

o 2x bandwidth reduction for output data

o Carefully choose logic location to avoid impact on DRAM array layout

12

PE array

Memory

Y += W * X

XW Y Y

PE array

Memory

ΔY = W * X

XW ΔY

+

Scheduling and Partitioning for TETRIS

Dataflow Scheduling

Critical for maximizing on-chip data reuse to save energy

14

foreach b in batch Nbforeach ifmap u in Niforeach ofmap v in No

// 2D convO(v,b) += I(u,b) * W(u,v) + B(v)

Mapping: execute 2D conv on PE array

• Regfiles and array interconnect

• Row stationary [Chen et al., ISCA’16]

Ordering: loop blocking and reordering

• Locality in global buffer

• Non-convex, exhaustive search

Few data reuse opportunities in TETRIS small buffers

IW bypass, OW bypass, IO bypass

o Use buffer only for one stream for maximum benefit

o Bypass buffer for the other two to sacrifice their reuse

ifmaps ofmaps filters

Global buffer

Reg files

OW bypass ordering

Off-chip

On-chip

TETRIS Bypass Ordering

15

Chunk 0

Chunk 1

Chunk 2

1. Read 1 ifmap chunk into gbuf2. Stream ofmaps and filters to regf3. Move ifmaps from gbuf to regf4. Convolve

TETRIS Bypass Ordering

Near-optimal schedules

o With 2% difference from schedules derived from exhaustive search

Analytically derived

o Closed-form solution

o No need for time-consuming exhaustive search

16

min𝐴DRAM= 2 × 𝑁b𝑁o𝑆o × 𝑡i + 𝑁b𝑁i𝑆i + 𝑁o𝑁i𝑆w × 𝑡b

s.t. ൞

𝑁b𝑡b

×𝑁i𝑡i× 𝑆i ≤ 𝑆buf

1 ≤ 𝑡b ≤ 𝑁b, 1 ≤ 𝑡i ≤ 𝑁i

NNRuntime Gap

(w.r.t. optimal)Energy Gap

(w.r.t. optimal)

AlexNet 1.48 % 1.86 %

ZFNet 1.55 % 1.83 %

VGG16 0.16 % 0.20 %

VGG19 0.13 % 0.16 %

ResNet 2.91 % 0.78 %

NN Partitioning

Fmap partitioning

o Divide a fmap into tiles

o Each vault processes one tile

Minimum remote accesses

o Prior work exclusively used fmappartitioning

17

Layer i Layer i+1

Vault 0 Vault 1

Vault 2 Vault 3

Process NN computations in parallel in all vaults

NN Partitioning

Output partitioning

o Partition all ofmaps into groups

o Each vault processes one group

Better filter weight reuse

Fewer total memory accesses

18

Layer i Layer i+1

Vault 0Vault 1

Vault 2

Vault 3

Process NN computations in parallel in all vaults

TETRIS Hybrid Partitioning

Combine fmap partitioning and output partitioning

Total energy = NoC energy + DRAM energy

o Balance between minimizing remote accesses and total DRAM accesses

Difficulties

o Design space exponential to # layers

Greedy algorithm reduces to be linear to # layers

o Involved dataflow scheduling to determine total DRAM accesses

Bypass ordering to quickly estimate total DRAM accesses

19

TETRIS Evaluation

Methodology

Five state-of-the-art NNs

o AlexNet, ZFNet, VGG16, VGG19, ResNet

o 100—300 MB total memory footprint for each NN

o Up to 152 layers in ResNet, one of the largest NNs today

2D and 3D accelerators with one or more NN engines

o 2D engine: 16 x 16 PEs, 576 kB buffer, one LPDDR3 channel

• 8.5 mm2, 51.2 Gops/sec

• Bandwidth-constrained

o 3D engine: 14 x 14 PEs, 133 kB buffer, one HMC vault

• 3.5 mm2, 39.2 Gops/sec

• Area-constrained21

Single-Engine Comparison

Up to 37% performance improvement with TETRIS

o Due to higher bandwidth despite smaller PE array

22

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14

No

rmal

ize

d R

un

tim

e

Large NNs benefit more!

AlexNet ZFNet VGG16 VGG19 ResNet

35–40% energy reduction with TETRIS

o Smaller on-chip buffer, better scheduling

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14

No

rmal

ize

d E

ne

rgy

Series5

Series4

Series3

Series2

Series1

Single-Engine Comparison

23

From SRAM & DRAM, and static


Multi-Engine Comparison

4 2D engines: 34 mm2, pin constrained (4 LPDDR3 channels)

16 3D engines: 56 mm2, area constrained (16 HMC vaults)

4.1x performance gain 2x compute density

24

0

0.05

0.1

0.15

0.2

0.25

0.3

1 2 3 4 5 6 7 8 9 10 11 12 13 14

No

rmal

ize

d R

un

tim

e


Multi-Engine Comparison

1.5x lower energy

o 1.2x from better scheduling and partitioning

4x computation only costs 2.7x power

25

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14

No

rmal

ize

d E

ne

rgy

Series5

Series4

Series3

Series2

Series1


TETRIS Summary

A scalable and efficient NN accelerator using 3D memory

o 4.1x performance and 1.5x energy benefits over 2D baseline

Hardware features

o PE/buffer area rebalancing

o In-memory accumulation

Software features

o Analytical dataflow scheduling

o Hybrid partitioning

26

Thanks!Questions?

Download - TETRIS: Scalable and Efficient Neural Network Acceleration ... › pdf › Mingyu_Gao.pdfTETRIS Bypass Ordering Near-optimal schedules o With 2% difference from schedules derived from

Top Related