processor in memory cortical extensions - … processor in memory cortical extensions paul franzon...

29
1 Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351, [email protected] Team members: Lee Baker, Sumon Dey, Josh Schabel, Weifu Li Funding:

Upload: doanmien

Post on 08-Jun-2018

254 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

1

Processor in Memory Cortical Extensions

Paul Franzon

Department of Electrical and Computer EngineeringNC State University

919.515.7351, [email protected]

Team members: Lee Baker, Sumon Dey, Josh Schabel, Weifu Li

Funding:

Page 2: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

5

Server Power Costs ~$20B/year worldwide

http://perspectives.mvdirona.com/2010/09/overall-data-center-costs/

Page 3: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

6

Energy/32-bit Op

0 5 10 15 20 25 30 35

Interposer IO3 mm onchip

TSV IO16 nm 16 MB DRAM core

16 KB L1 Cache (16 B)64 KB SRAM

8 MB Array RRAMLoad Store Unit

Integer ALUFPU

Energy (pJ) / 32 bit Op

7 nm 28 nm

Page 4: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

7

SIMD Compute Tile Single Instruction Multiple Data

16 Floating Point Units

Shared Control Configurable Register File

Scratch Pad SRAM

Caches

register write back0%decode stage

0%

fetch stage0%

operand collector

8%register file

39%

load-store unit1%

execution52%

Total power

32 GFLOPS/W in 65 nm on FFT Benchmark

Page 5: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

8

Cortical / Machine Learning Algorithms

These algorithms are interconnect & memory bound – speedup can be achieved in appropriate distributed architecture

Algorithm SuccessConvolutional Networks Image & speech recognition; Under intense

application driven explorationHierarchical TemporalMemory

Anomaly detection; “Digital” pattern matching

Cogent Confabulation Credit Car Fraud Detection; Text RecognitionSparsey Video action recognitionLSTM Incremental learningLSM Spiking network with STDP

Page 6: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

9

ASIP with PIM

Execution LaneExecution Lane

Execution Lane…

PIMPIM

PIM

Control PIM

SRAMSRAM

SRAM

x16

PE

x16

PE…

Rin

g

{

Application Specific Instruction Processor Processor in Memory

DRA

M

Page 7: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

10

Sparsey Hierarchical Recurrent Spatial & temporal

associativity Sparse distributed codes “One hot” learning Static, on/off synaptic

connections Sigmoid neuron model Highly parallelizable binary

and floating point arithmetic

Neurons in competitive modules (CMs) compete to be active

Page 8: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

11

HTM Hierarchical Recurrent Sparse distributed

representations “One hot” learning Statically connected synapses

for spatial learning and inference

Dynamically connected synapses for temporal learning, inference, and predictions

Predictions lead to stability Highly divergent binary and

integer operations

Column of neurons

Distal Segments

Page 9: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

12

Technical Approach: CUDA to ASIP

Fully manual identification process of bottlenecks, ISEs, and PIMs Could use automated ISE identification as in:

Deepti Shrimal and Manoj Kumar Jalin, “Instruction Customization: A challenge in ASIP realization,” IJAC, 2014. D. Shapiro et al., “Asips for artificial neural networks,” SACI, 2011.

Page 10: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

13

GPU Implementation of Sparsey “Server class” GPU

nVidia Tesla K20C with 2496 CUDA cores

KTH Video “action recognition” benchmark

Parameter ResultRun time 8 ms / framePower* ~210 W*Hot spots Weight

summation 80%

Softmax 10%SynapseMemory

~50 MB

800 cells

20,000 cells

80,000 cells

* datasheet

Page 11: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

14

GPU Implementation of Numenta HTM “Consumer class” GPU

392 core Nvidia GeForce GT 640 KTH Video “action recognition”

benchmark

Parameter ResultRun time 225 ms / framePower ~70 W*Hot spots Get Column

Overlap33%

ColumnInhibit 10%CreateSegment 44%

Memory ~100 MB

87%}* Datasheet

Page 12: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

15

Cognitive ISA Extensions PIM = Processor in Memory

Sparsey HTMBSUM F2I

IMAX, IMIN

I2F GETBIT

FMAX, FMIN ICOMP (GT,LT,EQ,GTE,LTE)

ONEHOT PIM-NODE-BITONIC-SORT-FP

PIM-SELECT-WINNER-LD PIM-NODE-BITONIC-SORT-INT

PIM-SELECT-WINNER-LE PIM-CLUSTER-MULTI-MURGE

PIM-UPDATE-WINNER PIM-GALAXY-MULTI-MERGE

PIM-WEIGHT-SUM LDI or STI to address zero is null

Page 13: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

16

Processor In Memory Accelerates key distributed tasks in both

algorithms E.g. Sparsey weight summation implemented as PIM Learning disabled results below (per frame of video) Enabling learning adds about 15% to run time

0 1 2 3 4 5 6 7 8 9 10

GPU

SIMD

SIMDWithPIM

8 ms

8.6 ms

0.39 ms

Page 14: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

19

HTM Instruction Set Extensions Vector-to-scalar ops

Min, Max

Binary-to-scalar ops

Scalar-to-binary ops

Page 15: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

20

HTM PIMs Bitonic Sorter

64 element integer sorter for sorting column IDs in spatial pooler

32 element FP sorter for sorting Euclidean distances in temporal pooler

Scatter-gather Temporal pooling phase, creating

new distal synapses Streams data from RAM like PIM Works on data from execution

lanes’ RF port

Page 16: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

21

Results @ 65 nm, 250 MHz

~17xSpeedup!!!

~16xSpeedup!!!

GPU Baseline

32 16-wide SIMD

32 16-wide SIMD w/ PIMs

Tech. node 28 nm 65 nm 65 nm

Sparsey

Latency per Frame

8 ms 9.26 ms 526 µs

Power 200 W* 2.025 W 2.025 W***

Logic Area 561 mm2 25.5 mm2 26.7 mm2

Performanceper Power

1 85 1490

Performance per Area

1 19 319

HTM

Latency per Frame

225 ms 100 ms 12.7 ms

Power 70 W** 2.308 W 2.308 W***

Logic Area 118 mm2 25.5 mm2 29.7 mm2

Performanceper Power

1 68 537

Performance per Area

1 10 70

*Tesla K20C datasheet, 706 MHz**GeForce GT 640 datasheet, 902 MHz***Does not compensate for when SIMD is stalled and PIMs are on. Just SIMD power.

Page 17: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

22

Results @ 65 nm, 250 MHz

~537xPower

efficient

1490x power

efficient

GPU Baseline

32 16-wide SIMD

32 16-wide SIMD w/ PIMs

Tech. node 28 nm 65 nm 65 nm

Sparsey

Latency per Frame

8 ms 9.26 ms 526 µs

Power 200 W* 2.025 W 2.025 W***

Logic Area 561 mm2 25.5 mm2 26.7 mm2

Performanceper Power

1 85 1490

Performance per Area

1 19 319

HTM

Latency per Frame

225 ms 100 ms 12.7 ms

Power 70 W** 2.308 W 2.308 W***

Logic Area 118 mm2 25.5 mm2 29.7 mm2

Performanceper Power

1 68 537

Performance per Area

1 10 70

Page 18: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

23

Results @ 65 nm, 250 MHz

~70xLess

Silicon

~319xLess

silicon

GPU Baseline

32 16-wide SIMD

32 16-wide SIMD w/ PIMs

Tech. node 28 nm 65 nm 65 nm

Sparsey

Latency per Frame

8 ms 9.26 ms 526 µs

Power 200 W* 2.025 W 2.025 W***

Logic Area 561 mm2 25.5 mm2 26.7 mm2

Performanceper Power

1 85 1490

Performance per Area

1 19 319

HTM

Latency per Frame

225 ms 100 ms 12.7 ms

Power 70 W** 2.308 W 2.308 W***

Logic Area 118 mm2 25.5 mm2 29.7 mm2

Performanceper Power

1 68 537

Performance per Area

1 10 70

Page 19: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

24

3D Implementation Use 3D DRAM to

supply 4 Tbps memory Now in test Most features except

PIM in this chip

130 nm

130 nm

DiRAM4

DBI

RDL

SRAMSIMD

8 um

Page 20: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

25

32 nm Tapeout Complete accelerator with ISEs and PiMs Still in fab

Page 21: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

26

Configurable Sparsey ASICLocal Interconnection (Mesh)

PE0 PE99….

Cluster Manager

Global Interconnect

ActivePEs

Shared Memory (2.25 Kb)

CM0 CM7….

….

PE ControllerCluster Controller

Inputs(i.e., FVs, RN)

PE Output

Shared Memory (0.25 Kb)

Cell0 Cell15….

CM ControllerPE Controller

Inputs(i.e., FVs, G)

Winner CellIndex

WTA ρ dist.

Select Winners with and without learning Update Weights

𝜌𝜌𝑚𝑚𝑚𝑚 𝑖𝑖 = �𝑐𝑐=0

𝑖𝑖

𝜌𝜌(𝑐𝑐)

SpatiotemporalFamiliarity

𝐺𝐺 = �𝑥𝑥=0

𝑄𝑄−1𝑉𝑉𝑋𝑋𝑄𝑄

Page 22: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

27

Sparsey ASIC

Factor GPU Baseline ASIC

Sparsey

KTH Run time 8 ms (1) 0.06 ms

Power 200 W 22 W

Logic Area 220 mm2

Performance/Power 1 1300

Page 23: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

28

HTM Processor Core

…FIFO

.

.

.

.

.

.

.

.

32Bit

Arithmetic Unit

Memory Controller

Configuration Data

Controller

Working Data

SRAMAddr 32 bit

Inter-Core Interface

.

.

.

.

.

.

Processing Element

Processing Element

Processing Element

Processor Core

Inter-Core Port (32 Bit)

Comparator Multiplexer

AdderAccumulatorMerge Tree

Page 24: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

29

HTM ASIC

Numenta HTM GPU ASIC

KTH run time 225 ms 0.63 ms

Power 70 W 298mW

Logic Area 5.5mm2

Performance/Power 1 47,100

Page 25: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

30

Memory Usage HTM and Sparsey can work with unlabeled data

and in-line They essentially do this by remembering everything and

organize the associativity in time and space based on the learned data

Page 26: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

31

Memory Usage Fully customized 65 nm ASICs for Sparsey and HTM

SIMD with PiM Solution @ 30 fps

Small by real world standards

Algorithm Bandwidth Capacity Frame RateSparsey 530 Tb/s 800 Mb 16000 fpsNumenta HTM 0.2Tb/ps 524 Mb 1600 fps

Algorithm Bandwidth CapacitySparsey 4.8 Tbps 314 MbNumenta 0.9 Tbps 1130 Mb

Page 27: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

33

Tezzaron DiRAM4 “Dis-integrated” RAM

DRAM only tiers Sense amps, etc. on logic

only tiers 32 bit DQ / port

Family Name Ports Banks per

Port Interface Density Data Bandwidth Latency

DiRAM4-64C64™

64 64 0.6 – 1.3VCMOS I/O

64 Gb 4/4 Tb/s(R/W)

9 ns

DiRAM4-64C32™

64 64 0.6 – 1.3VCMOS I/O

32 Gb 4/4 Tb/s(R/W)

9 ns

DiRAM4-64C16™

64 64 0.6 – 1.3VCMOS I/O

16 Gb 4/4 Tb/s(R/W)

9 ns

Page 28: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

34

Custom HW for LSTM Early Performance/Power Comparisons

GPU: Single precision storage and arithmetic ASIC: Half-precision storage and arithmetic

Name GPU ASIC XTechnology 28 nm 32 nmArea (mm2) 118 54.15 2.18Speed (ms) 93 8.5 11.0Power (W) 65 27.3 2.38

Speed/power 26.18

Page 29: Processor in Memory Cortical Extensions - … Processor in Memory Cortical Extensions Paul Franzon Department of Electrical and Computer Engineering NC State University 919.515.7351,

36

Conclusions CPU + PiM extensions lead to (c.f. GPU)

10x – 100x improvement in performance 103 – 104 improvement in performance/powerWhile preserving generality

ASIC gives another 10x in performance, and 1x to 100x in performance/power

Memory capacity and bandwidth management key to success Customized Memory/PiM mating