gap8 iot application processor a pulp/riscv based...

23
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR- SENSOR ANALYTICS Eric Flamand. CoFounder & CTO Greenwaves Technologies

Upload: others

Post on 31-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

GAP8 IOT Application Processor

A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR ANALYTICS

Eric Flamand. CoFounder & CTOGreenwaves Technologies

Page 2: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

2

?

Page 3: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

Cost of transporting these data over the air ?

Even assuming distributed computing is marginaly more efficient than centralized we win big if data volume to be exchanged over the air is srinked by several order of magnitude moving from quantitative data to qualitative data!

Serial short reach link: best results around 0.5 pJ/bitLTE: between 300 and 600 uJ/bit

3

Page 4: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

So in practice ….If we want data reach sensors

Move from (raw) data to meta data (abstract/pertinent)Perform this transformation close to sensorWhile fitting in a tight power and cost budgetAnd being seamlessly integrated to the Internet over the air

4

Page 5: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

Three main sources of intensive data

• Image: Raw input in the order of 100KB/s for a small sensor

• Scene classification• Posture analysis• Identification

• Voice/Sound: Raw input in the order of 10KB/s per mic

• Recognition• Identification• Signature analysis

• Vibrations: Raw input in the order of 10KB/s

• Preventive maintenance• Monitoring

Once properly processed, common denominator is:extremely compact output (single index, alarm, …)

Output is a single index

Output is a single index

Output is a single index or an alarm

Bandwidth is reduced by

several order of magnitude

5

Page 6: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

What we want to achieve

Giga/Mega Bytes per second of incoming raw data from sensors

Few (Kilo) Bytes per second of outgoing, heavily processed data @ minimum

Joule per operation

6

Page 7: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

System level view

Long range, low BW

Short range, BW

Low rate (periodic) data

SW update, commands

U Controller

L2 Memory

IOs

ULP

Parallel Processor

Battery powered

Sense Analyze and Classify,Manage radio diversity

Transmit

LTE CatMNB-IOT…..

Resolution just enough for the job, order of QVGA/VGA

Standard SW stack here (legacy)

Hooked as an a computing accelerator. This is where we want to make the difference

7

Page 8: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

General pattern for content understanding

• Extract descriptors from raw data• 2D: Corners, blobs, HOG, DOG, …• 1D: LPC coefficients, Cepstral coeffs, …

• Use descriptors to classify data among representative families• Machine learning (CNN, SVM, Boost), Bayesian, ….

Usually highly parallel

Also highly parallel

8

Page 9: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

GAP8: Ultra Low Power IoT Processor

GAP8 has a unique energy efficiency across a very large range of computing power

Architecture efficiency• Extended Risc-V ISA• Low contention shared memory 8 +1

core clustered architecture• Tight synchronization• CNN based pattern matching engine

(HWCE)

Application affinity• Dominant signal processing

part• Limited memory requirement• Limited SW legacy

Performances• up to 12GOPS• up to 0.4GOPS @ 1mW, • up to 40MOPS @ 300uW• 3 uWatt stand-by power

consumption

Leveraging open source projects• Risc-V (Berkeley)• PULP (ETHZ, UniBo)

HW features• Smart IOs• Voltage

regulator/DVFS • RTC• Secured execution

Low cost processor

• 55nm LP• 0.5MB L2• aQFN 84

Core

0

Core

1

Core

2

Core

3

Core

4

Core

5

Core

6

Core

7

Logarithmic Interconnect

Shared L1 Memory

Shared Instruction CacheDbg Unit

DMA

HWCE

HW Sync

ClusterL2 Memory

LVDSUARTSPII2SI2C

// 10bGPIOsJtag

FabricCtrler

I$

L1Mic

ro D

MA

ClkDbg

Rom

PMU

DC/DC

RTC

Page 10: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

Greenwaves Technologies. Confidential 10

eR

ISC

-V

Logarithmic Interconnect

Shared L1 Memory

Shared Instruction CacheDbg Unit

DMA

CNN-HWE

HW Sync

ClusterL2 Memory

LVDSUARTSPII2SI2C

// 10bGPIOs

HyperBus

eRISC-V

I$

L1

Mic

ro D

MA

ClkDbg

Rom

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

GAP8 Hierarchical Architecture

monitoring event qualification,protocol stack,system control

data analysis & classification,SW modem

Smart I/Osvoltage regulator & RTCSRAM in retentive mode

extended RISC-V extended RISC-Vefficient 8 core parallelization

HW synchronizationshared instruction cache

CNN HW engine

quasi stand-by low computing power high computing power

uWs mWs 10 to 20 mWs

primary energy consumption

primary energy consumption

Page 11: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

Greenwaves Technologies. Confidential 11

GAP8 architectural energy efficiency gains

data analysis & classification,

SW modem

extended RISC-Vefficient 8 core parallelization

HW synchronizationshared instruction cache

CNN HW engine

3-5x

1.4x

5x

1.5x

overall, in practice on

targeted algorithms,

typically 20x

eR

ISC

-V

Logarithmic Interconnect

Shared L1 Memory

Shared Instruction CacheDbg Unit

DMA

CNN-HWE

HW Sync

ClusterL2 Memory

LVDSUARTSPII2SI2C

// 10bGPIOs

HyperBus

eRISC-V

I$

L1

Mic

ro D

MA

ClkDbg

Rom

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

Page 12: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

Greenwaves Technologies. Confidential 12

GAP8 Advanced Power Management

eR

ISC

-V

Logarithmic Interconnect

Shared L1 Memory

Shared Instruction CacheDbg Unit

DMA

CNN-HWE

HW Sync

ClusterL2 Memory

LVDSUARTSPII2SI2C

// 10bGPIOs

HyperBus

eRISC-V

I$

L1

Mic

ro D

MA

ClkDbg

Rom

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

✔ Embedded DC/DC, low current✔ Real Time Clock 32KHz only✔ L2 Memory partially retentive

MCU sleep mode

eR

ISC

-V

Logarithmic Interconnect

Shared L1 Memory

Shared Instruction CacheDbg Unit

DMA

CNN-HWE

HW Sync

ClusterL2 Memory

LVDSUARTSPII2SI2C

// 10bGPIOs

HyperBus

eRISC-V

I$

L1

Mic

ro D

MA

ClkDbg

Rom

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

✔ Embedded DC/DC, high current✔ Voltage can dynamically change✔ One clock gen active, frequency can

dynamically change✔ Systematic clock gating

MCU active modeeR

ISC

-V

Logarithmic Interconnect

Shared L1 Memory

Shared Instruction CacheDbg Unit

DMA

CNN-HWE

HW Sync

ClusterL2 Memory

LVDSUARTSPII2SI2C

// 10bGPIOs

HyperBus

eRISC-V

I$

L1

Mic

ro D

MA

ClkDbg

Rom

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

eR

ISC

-V

✔ Embedded DC/DC, high current✔ Voltage can dynamically change✔ Two clock gen active, frequencies can

dynamically change✔ Systematic Clock Gating

MCU + Parallel processor active mode

uW r

ange

1 m

W r

ange

10-2

0 m

W r

ange

Ultra fast switching time from one mode to anotherUltra fast voltage and frequency change time

Highly optimized system level power consumption

Page 13: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

Qualitative data from real life applications

Page 14: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

The work horse for radio, sound and vibration: FFT

Key operations for performanceComplex MultiplicationsComplex RotationsPost modified accessesVectorial operations

All these butterflies are evaluated in parallel

Radix4 Butterfly

Page 15: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

The work horse for radio, sound and vibration: FFT

FFT 256 FFT 1024 FFT 4096

1167 4842 22710

Number of Cycles running on 8 cores

FFT 256 FFT 1024 FFT 4096

11264 56320 225280

Number of operations (*,+,>>,Ld/St)

Ideal means perfect scaling, perf on N cores = Perf on 1 core / N

Page 16: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

The work horse for radio, sound and vibration: FFT

ARM FFT1024 Q15 Data are with CMSIS optimized library

X 9X 12X 180

Page 17: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

Visual Localization: FFT2D + HOG

1 2 4 80

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

FFT2D 256x256. Radix 4

FFT 2D 256 (Radix4)

Ideal

Number of Cores

Cyc

les

1 2 4 80

10

20

30

40

50

60

70

Histogram Of Gradients

8x8 Cells, Blocks: [2x2] Cells, 50% Overlap. 640x480 BW Image

actual

ideal

Number of Cores

Cyc

les

Pe

r P

ixe

l

1 384 000 cycles per image 589 000 cycles per image

We need only 2 MHz per image

Page 18: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

CNN based Image Classification32x32

5x5 Conv

MaxPool 2x2:1

Linear

28x28 28x28

MaxPool 2x2:1

14x14 14x14

5x5 Conv

MaxPool 2x2:1

10x10 10x10

MaxPool 2x2:1

5x5 5x5

1 10

1 8

1

1

1 12

12

8

Cifar10

Layer0 Layer1 Layer2 Total0

100000

200000

300000

400000

500000

600000

700000

800000

229049

473663

8330

711042

1782643088

4119

65033

Cifar10

1 Core

2 Cores4 Cores

8 Cores

8 Cores + HWCECyc

les

28x28

5x5 Conv

MaxPool 2x2:1

Linear

24x24 24x14

MaxPool 2x2:1

12x12 12x12

5x5 Conv

MaxPool 2x2:1

8x8 8x8

MaxPool 2x2:1

4x4 4x4

1 10

1 32

1

1

1 64

64

32

ReLu ReLu

24x24 24x141 32

ReLu ReLu

8x8 8x81 64

Mnist

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

889282

6706656

24787

7620725

57536

749451

9744

816731

Mnist

1 Core

2 Cores

4 Cores

8 Cores

8 Cores + HWCECyc

les

1 core to 8 cores + HWCE: 10.9 speedup 1 core to 8 cores + HWCE: 9.3 speedup

CIFAR10 MNIST

Page 19: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

Global Normalization

Contrastive Normalization

Convolution 3x3

ReLU

MaxPool 2x2

Convolution 3x3

ReLU

MaxPool 2x2

Convolution 3x3

ReLU

MaxPool 2x2

Linear

ReLU

Linear

128 x 128

128 x 128

1 x 128 x 128 => 32 x 126 x 126

32 x 126 x126 => 32 x 63 x 63

32 x 63 x 63 => 32 x 61 x 61

32 x 61 x 61 => 32 x 30 x 30

32 x 61 x 61 => 32 x 61 x 61

32 x 30 x 30 => 32 x 28 x 28

32 x 28 x 28 => 32 x 14 x 14

32 x 28 x 28 => 32 x 28 x 28

6272 x 64 => 64

64 => 64

64 x 13 => 13

32 x 126 x 126 => 32 x 126 x 126

Trainable Par: 421 263Neurons: 1 511 904

050

100150200250300350400450

391.3

205.8

112.869.1

33.3

CNN 13 Layers, 128x128 Input, 14 Outputs

Time @ 250MHz, ms

33ms per image

CNN based Image Classification

Page 20: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

People Counting• Filtering + Difference of Gradient + SVM-RBF• Open Space. Accuracy: approx 90%• 1 Image every 3 minutes => 10 years on a battery

0

20000000

40000000

60000000

80000000

100000000

120000000

Cycles

0100000020000003000000400000050000006000000700000080000009000000

Cycles

x 88.6 X 7.4

Page 21: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

Audio Processing

FFT

N sub bands, one triangle filter per sub band, mel scale Spectru m

3N Coefficients (Coeff, Coeff', Coeff'') at each sample time

0.65 MHz on 8 Cores

Page 22: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

Hierarchical Power Processing?

Sound ? Phonem ? Word ?

Core

0

Core

1

Core

2

Core

3

Core

4

Core

5

Core

6

Core

7

Logarithmic Interconnect

Shared L1 Memory

Shared Instruction CacheDbg Unit

DMA

HWCE

HW Sync

ClusterL2 Memory

LVDSUARTSPII2SI2C

// 10bGPIOsJtag

FabricCtrler

I$

L1Mic

ro D

MA

ClkDbg

Rom

PMUDC/D

C

RTC

Gradually turn on resources

Page 23: GAP8 IOT Application Processor A PULP/RISCV BASED ...soiconsortium.eu/wp-content/uploads/2017/08/FDSOI...GAP8 IOT Application Processor A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR

SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand

Conclusion• GAP8 bridges the gap between ultra low power

MCU and multi-core processor:– Smart sensing on data rich sensors achievable

within tinny power budget: uW in Idle, mW in micro controller mode, 5-20mW in number crunching mode : Few Mops to up to 12Gops

– Low cost bill of material

• GAP8 agile power management architecture combined with IOT low duty cycling is a perfect fit for FDSOI process