envision: a 0.26-to-10 tops/w subword-parallel · pdf...

56
14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI © 2017 IEEE International Solid-State Circuits Conference 1 of 56 ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI Bert Moons, Roel Uytterhoeven, Wim Dehaene, Marian Verhelst ESAT/MICAS - KU Leuven

Upload: trinhminh

Post on 07-Mar-2018

225 views

Category:

Documents


1 download

TRANSCRIPT

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 1 of 56

ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-

Voltage-Accuracy-Frequency-Scalable CNN Processor

in 28nm FDSOIBert Moons, Roel Uytterhoeven,Wim Dehaene, Marian Verhelst

ESAT/MICAS - KU Leuven

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 2 of 56

Augmented Reality Face Recognition Artificial Intelligence

CLOUD GPU

Raw Data

Information

Embedded Neural Networks

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 3 of 56

Augmented Reality Face Recognition Artificial Intelligence

Local Processing

Embedded Neural Networks

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 4 of 56

Augmented Reality Face Recognition Artificial Intelligence

Local Processing

1-to-10 TOPS/W CNN processing is crucial for

always-on embedded operation.

Embedded Neural Networks

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 5 of 56

AAA

Large-Scale, highly accurate CNN’s are too expensive for embedded always-on operation.

VGG-16 Recognition on LFW*

Classes 5760

Accuracy 92.5%

Complexity 15.4 GMACsModel Size 15 MBProcessingEnergy / frame@ 1 TOPS/W

~ 30 mJ/f~ 900 mW@ 30 fps

LFW

1200mAh - 1.5V

Drains in 2h

Always-on Neural Networks

[*] Labeled Faces in the Wild Data set

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 6 of 56

Presentation Outline

A. 1. Hierarchical Recognition2. DVAFS: Dynamic-Voltage-Accuracy-Frequency-Scaling

B. 1. Hardware Implementation 2. Results

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 7 of 56

Hierarchical recognition

Hierarchical processing enables always on CNN-based visual recognition

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 8 of 56

Hierarchical Face Recognition

Hierarchical processing enables always-on compute

6 MMACs 15.4GMACs

FaceDetected ?

Large-ScaleRecognition

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 9 of 56

Hierarchical Face Recognition

Hierarchical processing enables always-on compute

NY

12MMACs6 MMACs 15.4GMACs

FaceDetected ?

OwnerDetected ?

Large-ScaleRecognition

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 10 of 56

Hierarchical Face Recognition

Hierarchical processing enables always-on compute

FaceDetected ?

OwnerDetected ?

FriendDetected ?

Large-ScaleRecognition

NY N

N

12MMACs6 MMACs 15.4GMACs500MMACs

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 11 of 56

FaceDetected ?

OwnerDetected ?

FriendDetected ?

Large-ScaleRecognition

12MMACs6 MMACs 15.4GMACs500MMACs

Hierarchical Face Recognition

Hierarchical processing enables always-on compute

NY N

N

Always-on ~1% on ~0.1% on ~0.01% on

Increasing # Classes / Network Size / FP precision/ Energy per frame

CONV-16 MMACs

22 kB5-44%=02-4b Ops

94 % acc.

CONV-212 MMACs

42 kB8-45%=03-4b Ops

96 % acc.

CONV-3500 MMACs

742 kB 8-47%=04-6b Ops

94 % acc.

CONV-415 GMACs

15 MB5-82%=04-6b Ops

92.5 % acc.

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 12 of 56

DVAFS: Dynamic-Voltage-Accuracy-Frequency-Scaling

An at run-time Energy-vs-Computational Precision trade-off

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 13 of 56

Precision Scaling - DVASDVAS – Dynamic-Voltage-Accuracy-Scaling4

y3 y2

x2

x3

y1/0 y0/0x0/0

x1/0

z3z2z1z0 0000

Standard Multiplier

x3x2x1x0 y3y2y1y0

Gate LSB Gate LSB

As in [4] Moons, VLSI2016 ; Moons, JSSC2016

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 14 of 56

Precision Scaling - DVAFSDVAFS – Dynamic-Voltage-Accuracy-Frequency-Scaling

y11 y01 y10 y00

x01

x11

x00

x10

x11x01 y11y01

z31z21z11z01

Subword-Parallel Mult.

x10x00 y10y00

z30z20z10z00

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 15 of 56

Precision Scaling - DVAFSDVAFS – Dynamic-Voltage-Accuracy-Frequency-Scaling

y11 y01 y10 y00

x01

x11

x00

x10

x11x01 y11y01

z31z21z11z01

Subword-p. Multiplier

x10x00 y10y00

z30z20z10z00

DVAFS is a dynamic precision technique, lowering all run-time adaptable parameters:

activity , frequency and supply voltage

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 16 of 56

Precision Scaling – System Level

DVAFS

Ene

rgy/

wor

d

ComputeOverhead

DVAS

Ene

rgy/

wor

d

DVAFS outperforms DVAS as it minimizes non-compute overheads at low precision

CTRL &Transfer

Compute

Memory

ComputeOverheadHigh precision DVAS

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 17 of 56

Precision Scaling – System Level

DVAFS

Ene

rgy/

wor

d

ComputeOverhead

DVAFS

DVAFS outperforms DVAS as it minimizes non-compute overheads at low precision

CTRL &Transfer

Compute

Memory

ComputeOverhead

Ene

rgy/

wor

d

High precision DVAFS

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 18 of 56

Precision Scaling – System Level

DVAFS outperforms DVAS as it minimizes non-compute overheads at low precision

Precision [bits]

* T = 76 GOPS

Rel

. Ene

rgy

/ ope

ratio

n [-]

8x inDVAS

20x inDVAFS

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 19 of 56

Precision Scaling – BB in FDSOI

• DVAFS modulates leakage-vs-dynamic balance• Body-Bias tuning allows minimizing energy

High precision Low precision

Dom

inan

t

Pow

er @

f

Dom

inan

t

Pow

er @

f

Reduce VT, V@ constant (V - VT) and f

Increase VT, V@ constant (V - VT) and f

DynamicLeakage

BBnom BBoptimal BBnom BBoptimal

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 20 of 56

Processor Architecture

Exploits:

A. Parallelism and Data Reuse;B. Network sparsity;C. Varying precision through DVAFS.

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 21 of 56

• Convolution operators are highly parallel• Algorithm allows inherent data reuse

Convolutional Reuse Image Reuse

Filter

Image ImageFilters

1

2

Filter Reuse

Images

Filter

2

1

Three types of Reuse supported in Envision

Optimization: CNN Characteristics (A)

[3] Chen, ISSCC2016

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 22 of 56

RELU activations

Network SparsityLeNet-5 26-87%AlexNet 5-90%VGG 5-82%

Optimization: CNN Characteristics (B,C)

Network Precision LeNet-5 1-5 bitsAlexNet 4-9 bitsVGG (*95%) 4-6 bits

Non-uniform precision@ 99*% relative

benchmark accuracy

• CNN weights and activations are sparse.• Precision varies between apps, networks, layers

Sparsity Varying precision

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 23 of 56

A 2D-SIMD DVAFS Architecture

DMA

IOen/de-coder

RISCCTRL

data

2D-SIMD MAC-arrayInput processing

Inpu

t pro

cesi

ng

ALU 1D-SIMD: ReLu, Max-pool, MAC,

data

PM

GRDGRD

DMA DMB

DMC DMD

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 24 of 56

A 2D-SIMD DVAFS Architecture

DMA

IOen/de-coder

RISCCTRL

data

2D-SIMD MAC-arrayInput processing

Inpu

t pro

cesi

ng

ALU 1D-SIMD: ReLu,Max-pool, MAC,

data

PM

GRDGRD

DMA DMB

DMC DMD

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 25 of 56

A 2D-SIMD DVAFS Architecture

……

… … …

……

… … …

Filter Image Partial Sum

* =

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 26 of 56

A 2D-SIMD DVAFS Architecture

1x16b

No Reuse in Scalar Solution

Filter SRAM

Feature SRAM……

… … …

1 Feature

*1 Weight

1x16bM

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 27 of 56

A 2D-SIMD DVAFS Architecture

1x16b

Convolutional Reuse in 1D-SIMD

16 Features

*1 Weight

16x16b / 1x16b

Filter SRAM

Feature SRAM……

… … …

M M M…

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 28 of 56

A 2D-SIMD DVAFS Architecture

1x16b

Convolutional Reuse in 1D-SIMD

16 Features

*1 Weight

16x16b / 1x16b

Filter SRAM

Feature SRAM……

… … …

M

F I F O

M M…

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 29 of 56

A 2D-SIMD DVAFS Architecture

M

F I F O

M

M M M

M

M

M

M

…… … …

16x16b

Convolutional + Image Reuse in 2D-SIMD

Filter SRAM

Feature SRAM……

… … …

16 Features

*16 Weights

16x16b / 1x16b

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 30 of 56

A 2D-SIMD DVAFS Architecture

Feature SRAM……

… … …

MFilter SRAM

M

M M M

M

M

M

M

…… … …

16x(Nx16b/N) / 1x(Nx16b/N)

16x(Nx16b/N)

16N Features

*16N Weights

Cnv. + Image + Filter Reuse in 2D-SIMD DVAFS

……

… … …

MM MM MM

MM MM MM

MM MM MM

N=2

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 31 of 56

N=1

A 2D-SIMD DVAFS Architecture

Feature SRAM……

… … …

MFilter SRAM

F I F O

M

M M M

M

M

M

M

…… … …

16x(Nx16b/N) / 1x(Nx16b/N)

16x(Nx16b/N)

16N Features

*16N Weights

Cnv. + Image + Filter Reuse in 2D-SIMD DVAFS

……

… … …

MFilter SRAM

Feature SRAM……

… … …

16b

16b

Accumulate

48b

48b

N = 1, 1x16b 256 MAC units

SR*

*Status Register

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 32 of 56

A 2D-SIMD DVAFS Architecture

Feature SRAM……

… … …

Filter SRAM

MM

MM M M

M

M

M

M

…… …

16x(Nx16b/N) / 1x(Nx16b/N)

16x(Nx16b/N)

16N Features

*16N Weights

Cnv. + Image + Filter Reuse in 2D-SIMD DVAFS

……

… … …

MM MM

MM MM

MM MM

N=2

Feature SRAM……

… … …

Filter SRAM

……

… … …

Unused

8b 8b

48b

8b

2x24b

2x24b

N = 2, 2x8b

Unused

512 MAC units

SR*

*Status Register

M

MM

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 33 of 56

A 2D-SIMD DVAFS Architecture

Feature SRAM……

… … …

MFilter SRAM

M

M M M

M

M

M

M

…… … …

16x(Nx16b/N) / 1x(Nx16b/N)

16x(Nx16b/N)

16N Features

*16N Weights

Cnv. + Image + Filter Reuse in 2D-SIMD DVAFS

……

… … …

MM MM MM

MM MM MM

MM MM MM

N=4

4b 4b 4b 4b

44

44

4x12b

4x12b

N = 4, 4x4bSR

*

*Status Register

1024 MAC units

Unused

Unused

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 34 of 56

A 2D-SIMD DVAFS Architecture

Feature SRAM……

… … …

MFilter SRAM

F I F O

M

M M M

M

M

M

M

…… … …

……

Guard SRAM and 2D-Array from sparse operators4

GR

D

0…

1

GRD 0 … 1

GRDSRAM

[4] Moons, VLSI 2016

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 35 of 56

Flexible Memory / IO compression

data

2D-SIMD MAC-arrayInput processing

Inpu

t pro

cesi

ng

ALU 1D-SIMD: ReLu,Max-pool, MAC,

DMA

IOen/de-coder

RISCCTRL

data

PM

GRDGRD

DMA DMB

DMC DMD

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 36 of 56

Flexible Memory / IO compression

data

2D-SIMD MAC-arrayInput processing

Inpu

t pro

cesi

ng

ALU 1D-SIMD: ReLu,Max-pool, MAC,

DMA

IOen/de-coder

RISCCTRL

data

PM

GRDGRD

DMA DMB

DMC DMDAs in [4] Moons, VLSI2016

• C-programmable4

• 16b Instructions4

• Huffman-based IOcompression,up to 5.8x in AlexNet4

• 16 kB PM4

• 128kB DM4

o 3-wise parallel acc.• 4kB GRD SRAM4

o sparsity flags

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 37 of 56

Physical Implementation

Efficiency and –scalability through granular Power and Body-Bias domains

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 38 of 56

Physical Implementation – 28 FDSOI

DMA

IOen/de-coder

RISCCTRL

PM

GRDGRD

DMA DMB

DMC DMD

2D-SIMD MAC-arrayInput processing

Inpu

t pro

cesi

ng

ALU 1D-SIMD: ReLu,Max-pool, MAC,

VMEMBBGND

V2DBB1

VCTRLBB2

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 39 of 56

2D-SIMD MAC array

RISC, DMA

MEM

Physical Implementation – 28 FDSOI1.29 mm

1.45

mm

1.87 mm2

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 40 of 56

Measurement Results

Efficiencies from 0.25-to-10 TOPS/W depending on Precision and Network Sparsity

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 41 of 56

Measurement Results

Throughput [GOPS]

1x16b BBnom

75 150 300

1

10

.6

1

.8

250.1

Eff.

[TO

PS

/W]

Volta

ge [V

]

0.25TOPS/W

1.05V

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 42 of 56

Measurement Results

Throughput [GOPS]

* 2x8b

1x16b BBnom

75 150 300

1

10

.6

1

.8

250.1

Eff.

[TO

PS

/W]

Volta

ge [V

]

1TOPS/W

0.8V

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 43 of 56

Measurement Results

Throughput [GOPS]

+

*

4x4b

2x8b

1x16b BBnom

.6

1

.8E

ff. [T

OP

S/W

]Vo

ltage

[V]

0.67V

1

0.1

4TOPS/W

10

75 150 30025

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 44 of 56

Measurement Results

Throughput [GOPS]

+

*

o30-60%Sparse 4x3-4b

4x4b

2x8b

1x16b BBnom

75 150 300

1

10

.6

1

.8

250.1

Eff.

[TO

PS

/W]

Volta

ge [V

]

8.2TOPS/W

0.61V

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 45 of 56

+

*

o30-60%Sparse 4x3-4b

4x4b

2x8b

1x16b

Measurement Results

Throughput [GOPS]

BBnom

75 150 300

1

10

.6

1

.8

250.1

BBnom = +/- .6VV = 0.85V

LD

Pow

er @

f, T

BBnom

Eff.

[TO

PS

/W]

Volta

ge [V

]

0.33TOPS/W

0.85V

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 46 of 56

+

*

o30-60%Sparse 4x3-4b

4x4b

2x8b

1x16b

Measurement Results

Throughput [GOPS] Throughput [GOPS]

BBoptBBnom

75 150 300 30025 75 150

1

10

.6

1

.8

250.1

BBnom = +/- .6VV = 0.85V

BBopt = +/- 1.2VV = 0.70V

LD

LD

Pow

er @

f, T

BBoptBBnom

Eff.

[TO

PS

/W]

Volta

ge [V

]

1.6x

0.33TOPS/W

0.85V

0.53TOPS/W

0.70V

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 47 of 56

+

*

o30-60%Sparse 4x3-4b

4x4b

2x8b

1x16b

Measurement Results

8.2TOPS/W

+

*

o30-60%Sparse 4x3-4b

4x4b

2x8b

1x16bBBnom = +/- .6VV = 0.61V

Pow

er @

f, T

BBoptBBnom

LD 1

10

.6

1

.8

0.1

Throughput [GOPS]75 150 30025

8.2TOPS/W

0.61V

BBnom

Eff.

[TO

PS

/W]

Volta

ge [V

]

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 48 of 56

+

*

o30-60%Sparse 4x3-4b

4x4b

2x8b

1x16b

Measurement Results

+

*

o30-60%Sparse 4x3-4b

4x4b

2x8b

1x16bBBnom = +/- .6VV = 0.61V

BBopt = +/- 0.2VV = 0.63V

Pow

er @

f, T

BBoptBBnom

1.2x

Throughput [GOPS]

BBopt

30025 75 150

8.2TOPS/W

1

10

.6

1

.8

0.1

Throughput [GOPS]75 150 30025

8.2TOPS/W

0.61V

10TOPS/W

0.63V

LD

BBnomL

D Eff.

[TO

PS

/W]

Volta

ge [V

]

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 49 of 56

+

*

o30-60%Sparse 4x3-4b

4x4b

2x8b

1x16b

Measurement Results

+

*

o30-60%Sparse 4x3-4b

4x4b

2x8b

1x16bBBnom = +/- .6VV = 0.61V

BBopt = +/- 0.2VV = 0.63V

LD

Pow

er @

f, T

BBoptBBnom

Throughput [GOPS]

BBopt

30025 75 150

8.2TOPS/W 40x

1

10

.6

1

.8

0.1

Throughput [GOPS]75 150 30025

8.2TOPS/W

0.61V

10TOPS/W

0.63V

LD

BBnom

1.2x

Eff.

[TO

PS

/W]

Volta

ge [V

]

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 50 of 56

Hierarchical Face Recognition Revisited

Hierarchical processing enables always-on compute

3 uJ/f2-4b CONV4.2 TOPS/W

6 uJ/fCONV

4 TOPS/W

500 uJ/fCONV

1.8TOPS/W

23100 uJ/f4-6b CONV1.3 TOPS/W

NY N

N

Always-on ~1% on ~0.1% on ~0.01% on

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 51 of 56

3 uJ/f2-4b CONV4.2 TOPS/W

Hierarchical processing enables always-on compute

6 uJ/fCONV

4 TOPS/W

500 uJ/fCONV

1.8TOPS/W

23100 uJ/fCONV

1.3 TOPS/W

NY N

N

Always-on ~1% on ~0.1% on ~0.01% on

This Functionality Always-onAt 6uJ / frame average CONV-

layer energy consumption

Hierarchical Face Recognition Revisited

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 52 of 56

Comparison

A. Highest scalability of Energy-vs-Computational Precision (40x)

B. Efficiencies up to 10 TOPS/W

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 53 of 56

Eyeriss3

ISSCC ’16Moons4

VLSI ‘16This work

N = 1, 2 or 4Technology 65nm LP 40nm LP 28nm FDSOI

fnomV @ fnom

Peak GOPS

200MHz1V67

200MHz1.1V102

200MHz1V

N x 102

ANet CONVVGG CONV

278mW@35fps-

76mW @ 47fps-

44mW @ 47fps26mW @ 1.7fps

Power [mW]@ GOPSnom

Min. Eff. Max. Eff.

235-332 (1.5x)@ 46 GOPS

0.17 TOPS/W0.25 TOPS/W

35-300 (8.5x)@ 80 GOPS

0.27 TOPS/W2.60 TOPS/W

7.5-300 (40x)@ 76 GOPS

0.25 TOPS/W10.0 TOPS/W

Comparison with SotA

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 54 of 56

Comparison with SotA

Throughput [GOPS]

Ene

rgy-

Effi

cien

cy [T

OP

S/W

]

1 10 100 10000.1

110

4-bit8-bit

16-bit

ID14.2This work

ID14.1

ID14.6

Moons4

Chen3

2017

2016

homes.esat.kuleuven.be/~mverhels/DLICsurvey.html

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 55 of 56

Summary

Envision: A 0.25-to-10 TOPS/W CNN processor, trading energy-vs-computational precision

14.5: ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable CNN Processor in 28nm FDSOI

© 2017 IEEE International Solid-State Circuits Conference 56 of 56

• Always-on through hierarchical computing.

• An energy-efficient CNN-architecture:1. 2D-SIMD baseline;2. DVAFS-compatible3. Operator guarding and IO-compression.

• Envision: a 0.25-to-10 TOPS/W @ 76 GOPSvarying with the required network precision.

Acknowledgement: This work was partly funded by FWO and Intel Corporation. We thank Synopsys for tool support, STMicroelectronics for silicon donation.

Summary