eyeriss: a spatial architecture for energy -efficient dataflow for convolutional...

Post on 12-May-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Yu-Hsin Chen1, Joel Emer1, 2, Vivienne Sze1

1 MIT 2 NVIDIA

2

Contributions of This Work• A novel energy-efficient CNN dataflow that has been

verified in a fabricated chip, Eyeriss.

• A taxonomy of CNN dataflows that classifies previouswork into three categories.

• A framework that compares the energy efficiency ofdifferent dataflows under same area and CNN setup.

4000 µm

4000 µm

On-c

hip

Buffe

r Spatial PE ArrayEyeriss [ISSCC, 2016]

A reconfigurable CNN processor

35 fps @ 278 mW*

* AlexNet CONV layers

3

Deep Convolutional Neural Networks

ClassesFCLayers

Modern deep CNN: up to 1000 CONV layers

CONVLayer

CONVLayer Low-level

FeaturesHigh-level Features

4

Deep Convolutional Neural Networks

CONVLayer

CONVLayer Low-level

FeaturesHigh-level Features

ClassesFCLayers

1 – 3 layers

5

Deep Convolutional Neural Networks

ClassesCONVLayer

CONVLayer

FCLayers

Convolutions account for morethan 90% of overall computation,dominating runtime and energyconsumption

6

R

Filter

R

High-Dimensional CNN Convolution

E

EPartial Sum (psum)

Accumulation

Input Image (Feature Map) Output Image

Element-wiseMultiplication

H

a pixel

H

7

HR

Filter

R

High-Dimensional CNN Convolution

E

Sliding Window Processing

Input Image (Feature Map)a pixel

Output Image

H E

8

H

High-Dimensional CNN Convolution

R

R

C

Input Image

Output ImageCFilter

Many Input Channels (C)

E

H E

9

High-Dimensional CNN Convolution

E

Output ImageManyFilters (M)

ManyOutput Channels (M)

M

R

R1

R

R

C

M

H

Input ImageC

C

H E

10

High-Dimensional CNN Convolution

M

ManyInput Images (N) Many

Output Images (N)…

R

R

R

R

C

C

Filters

E

E

H

C

H

H

C

E1 1

N N

H E

11

Memory Access is the Bottleneck

ALUfilter weightimage pixelpartial sum updated partial sum

Memory Read Memory WriteMAC*

* multiply-and-accumulate

12

Memory Access is the Bottleneck

ALU

Memory Read Memory WriteMAC*

* multiply-and-accumulate

DRAM DRAM

• Example: AlexNet [NIPS 2012] has 724M MACs à 2896M DRAM accesses required

Worst Case: all memory R/W are DRAM accesses

13

Memory Access is the Bottleneck

ALU

Memory Read Memory WriteMAC*

Extra levels of local memory hierarchy

MemDRAM DRAMMem

14

Memory Access is the Bottleneck

ALU

Memory Read Memory Write

Extra levels of local memory hierarchy

1

Opportunities: data reuse local accumulation1

MemDRAM DRAMMem

MAC*

15

Types of Data Reuse in CNNFilter ReuseConvolutional Reuse Image Reuse

CONV layers only(sliding window)

CONV and FC layers CONV and FC layers(batch size > 1)

Filter ImageFilters

2

1

Image Filter

2

1

Images

Image pixelsFilter weights

Reuse: Image pixelsReuse: Filter weightsReuse:

16

Memory Access is the Bottleneck

ALU

Memory Read Memory Write

Extra levels of local memory hierarchy

** AlexNet CONV layers1) Can reduce DRAM reads of filter/image by up to 500×**

1

Opportunities: data reuse local accumulation1

MemDRAM DRAMMem

1

MAC*

17

Memory Access is the Bottleneck

1) Can reduce DRAM reads of filter/image by up to 500×2) Partial sum accumulation does NOT have to access DRAM12

ALU

Memory Read Memory Write

Extra levels of local memory hierarchy

2

1

Opportunities: data reuse local accumulation1 2

MemDRAM DRAMMem

MAC*

18

Memory Access is the Bottleneck

Opportunities: data reuse local accumulation

• Example: DRAM access in AlexNet can be reducedfrom 2896M to 61M (best case)

1) Can reduce DRAM reads of filter/image by up to 500×2) Partial sum accumulation does NOT have to access DRAM

1 2

ALU

Memory Read Memory Write

Extra levels of local memory hierarchy

2

1

MemDRAM DRAMMem

12

MAC*

19

Spatial Architecture for CNN

ProcessingElement (PE)

Global Buffer (100 – 500 kB)

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DRAM

Local Memory Hierarchy• Global Buffer• Direct inter-PE network• PE-local memory (RF)

Control

Reg File 0.5 – 1.0 kB

20

Low-Cost Local Data Access

DRAM GlobalBuffer PE

PE PE

ALU fetch data to run a MAC here

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

200×6×

PE ALU 2×1×1× (Reference)

DRAM ALU

0.5 – 1.0 kB

100 – 500 kB

NoC: 200 – 1000 PEs

* measured from a commercial 65nm process

21

Low-Cost Local Data Access

ALU

Buffer ALU

RF ALU

Normalized Energy Cost*

200×6×

PE ALU 2×1×1× (Reference)

DRAM ALU

0.5 – 1.0 kB

100 – 500 kB

* measured from a commercial 65nm process

How to exploit data reuse and local accumulationwith limited low-cost local storage?

1 2

NoC: 200 – 1000 PEs

specialized processing dataflow required!

22

Taxonomy ofExisting Dataflows

• Weight Stationary (WS)

• Output Stationary (OS)

• No Local Reuse (NLR)

23

Weight Stationary (WS)

• Minimize weight read energy consumption− maximize convolutional and filter reuse of weights

• Examples:[Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014][Park, ISSCC 2015] [Origami, GLSVLSI 2015]

Global Buffer

W0 W1 W2 W3 W4 W5 W6 W7

Psum Pixel

PEWeight

24

• Minimize partial sum R/W energy consumption− maximize local accumulation

• Examples:

Output Stationary (OS)

[Gupta, ICML 2015] [ShiDianNao, ISCA 2015][Peemen, ICCD 2013]

Global Buffer

P0 P1 P2 P3 P4 P5 P6 P7

Pixel Weight

PEPsum

25

• Use a large global buffer as shared storage− Reduce DRAM access energy consumption

• Examples:

No Local Reuse (NLR)

[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014][Zhang, FPGA 2015]

PEPixel

Psum

Global BufferWeight

26

Energy Efficiency Comparison

0

0.5

1

1.5

2

WS OSA OSB OSC NLR RS

Nor

m. E

nerg

y/O

p

DataflowsNLRWS OSA OSB OSC Row

Stationary

Normalized Energy/MAC

CNN Dataflows

Variants of OS

• Same total area • 256 PEs• AlexNet Configuration* • Batch size = 16

* AlexNet CONV layers

27

Energy-Efficient Dataflow:Row Stationary (RS)

• Maximize reuse and accumulation at RF

• Optimize for overall energy efficiencyinstead for only a certain data type

28

1D Row Convolution in PE

* =Filter Output Image

Input Image

29

1D Row Convolution in PE

* =Filter Partial Sumsa b c a b c

a b c d e

PEReg Fileb ac

d ce ab

Input Image

30

1D Row Convolution in PE

* =Filtera b c a b c

a b c d e

e d

PEb ac

Reg File

b ac

a

Partial SumsInput Image

31

1D Row Convolution in PE

* =a b c

a b c d e Partial SumsInput Image

PEb ac

Reg File

c bd

b

ea

Filtera b c

32

1D Row Convolution in PE

* =a b c

a b c d e Partial SumsInput Image

PEb ac

Reg File

d ce

cb a

Filtera b c

33

1D Row Convolution in PE

PEb ac

Reg File

d ce

cb a

• Maximize row convolutional reuse in RF− Keep a filter row and image sliding window in RF

• Maximize row psum accumulation in RF

34

2D Convolution in PE Array

Row 1 Row 1

=*

*PE 1

35

2D Convolution in PE Array

Row 1 Row 1

Row 2 Row 2

Row 3 Row 3

Row 1

=*

*

*

*

PE 1

PE 2

PE 3

36

2D Convolution in PE Array

Row 1 Row 1

Row 2 Row 2

Row 3 Row 3

Row 1

=*

Row 1 Row 2

Row 2 Row 3

Row 3 Row 4

=*

* *

* *

* *

Row 2

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

37

2D Convolution in PE Array

PE 1Row 1 Row 1

PE 2Row 2 Row 2

PE 3Row 3 Row 3

Row 1

=*

PE 4Row 1 Row 2

PE 5Row 2 Row 3

PE 6Row 3 Row 4

Row 2

=*

PE 7Row 1 Row 3

PE 8Row 2 Row 4

PE 9Row 3 Row 5

Row 3

=*

* * *

* * *

* * *

38

Convolutional Reuse Maximized

Row 1

Row 2

Row 3

Row 1

Row 2

Row 3

Row 4

Row 2

Row 3

Row 4

Row 5

Row 3

* * *

* * *

* * *

Filter rows are reused across PEs horizontally

Row 1

Row 2

Row 3

Row 1

Row 2

Row 3

Row 1

Row 2

Row 3

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

PE 7

PE 8

PE 9

39

Convolutional Reuse Maximized

Row 1

Row 2

Row 3

Row 1

Row 1

Row 2

Row 3

Row 2

Row 1

Row 2

Row 3

Row 3

* * *

* * *

* * *

Image rows are reused across PEs diagonally

Row 1

Row 2

Row 3

Row 2

Row 3

Row 4

Row 3

Row 4

Row 5

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

PE 7

PE 8

PE 9

40

Maximize 2D Accumulation in PE Array

Row 1 Row 1

Row 2 Row 2

Row 3 Row 3

Row 1 Row 2

Row 2 Row 3

Row 3 Row 4

Row 1 Row 3

Row 2 Row 4

Row 3 Row 5

* * *

* * *

* * *

Partial sums accumulate across PEs vertically

Row 1 Row 2 Row 3

PE 1

PE 2

PE 3

PE 4

PE 5

PE 6

PE 7

PE 8

PE 9

41

Dimensions Beyond 2D Convolution3 Multiple Channels1 Multiple Images 2 Multiple Filters

42

Filter Reuse in PE3 Multiple Channels1 Multiple Images 2 Multiple Filters

R

R

C

H

C

H

H

C

H

Row 1 Row 1Channel 1Image 1

* Row 1=Psum 1Filter 1

Channel 1 Row 1 Row 1Image 2

* Row 1=Psum 2Filter 1

Processing in PE: concatenate image rows

Channel 1 *Row 1Image 1 & 2

=Psum 1 & 2Filter 1

Row 1 Row 1 Row 1 Row 1

share the same filter row

43

Image Reuse in PE3 Multiple Channels1 Multiple Images 2 Multiple Filters

R

R

C

R

R

C H

C

H

Row 1 Row 1Channel 1Image 1

* Row 1=Psum 1Filter 1

Channel 1 Row 1 Row 1Image 1

* Row 1=Psum 2Filter 2

share the same image row

Processing in PE: interleave filter rows

*Image 1

=Psum 1 & 2Filter 1 & 2

Row 1Channel 1

44

Channel Accumulation in PE3 Multiple Channels1 Multiple Images 2 Multiple Filters

R

R

C

H

C

H

Row 1 Row 1Channel 1Image 1

* Row 1=Psum 1Filter 1

Channel 2 Row 1 Row 1Image 1

* Row 1=Psum 1Filter 1

accumulate psums

Row 1 Row 1+ = Row 1

45

Channel Accumulation in PE3 Multiple Channels1 Multiple Images 2 Multiple Filters

R

R

C

H

C

H

Row 1 Row 1Channel 1Image 1

* Row 1=Psum 1Filter 1

Channel 2 Row 1 Row 1Image 1

* Row 1=Psum 1Filter 1

Channel 1 & 2Image 1

=PsumFilter 1

* Row 1

Processing in PE: interleave channels

accumulate psums

46

CNN Convolution – The Full Picture

Multiple images:

Multiple filters:

Multiple channels:

PERow 1 Row 1

PERow 2 Row 2

PERow 3 Row 3

PERow 1 Row 2

PERow 2 Row 3

PERow 3 Row 4

PERow 1 Row 3

PERow 2 Row 4

PERow 3 Row 5

* * *

* * *

* * *

Image 1=

PsumFilter 1

**

Image 1=

Psum 1 & 2Filter 1 & 2*

Image 1 & 2=

Psum 1 & 2Filter 1

47

Simulation Results• Same total hardware area

• 256 PEs

• AlexNet Configuration

• Batch size = 16

48

Dataflow Comparison: CONV Layers

RS uses 1.4× – 2.5× lower energy than other dataflows

NormalizedEnergy/MAC

ALURF

NoCbufferDRAM

0

0.5

1

1.5

2

WS OSA OSB OSC NLR RSCNN Dataflows

49

Dataflow Comparison: CONV Layers

0

0.5

1

1.5

2

NormalizedEnergy/MAC

WS OSA OSB OSC NLR RS

psums

weightspixels

RS optimizes for the best overall energy efficiency

CNN Dataflows

50

Dataflow Comparison: FC Layers

0

0.5

1

1.5

2

psums

weightspixels

NormalizedEnergy/MAC

WS OSA OSB OSC NLR RSCNN Dataflows

RS uses at least 1.3× lower energy than other dataflows

51

Row Stationary: Layer Breakdown

ALURF

NoCbufferDRAM

2.0e10

1.5e10

1.0e10

0.5e10

0L1 L8L2 L3 L4 L5 L6 L7

NormalizedEnergy

(1 MAC = 1)

CONV Layers FC Layers

RF dominates DRAM dominates

Total Energy80% 20%

52

Summary• We propose a Row Stationary (RS) dataflow to exploit

the low-cost local memories in a spatial architecture.

• RS optimizes for best overall energy efficiency whileexisting CNN dataflows only focus on certain data types.

• RS has higher energy efficiency than existing dataflows− 1.4× – 2.5× higher in CONV layers− at least 1.3× higher in FC layers. (batch size ≥ 16)

• We have verified RS in a fabricated CNN processor chip, Eyeriss

4000 µm

4000 µm

53

Thank You

Learn more about Eyeriss at

http://eyeriss.mit.edu

top related