in-memory computing based machine learning accelerators ... roy.pdfkaushik roy purdue university...

In-Memory Computing based Machine Learning Accelerators: Opportunities and Challenges

KAUSHIK ROY

PURDUE UNIVERSITY

[email protected]

1

Challenge: Computational Demands and Technology Scaling

➢Machine Learning (ML) has humongous computational cost

➢Dennard scaling has ended -> General-purpose systems have hit performance/energy wall

➢Domain specific acceleration (hardware/software) is the keyo Industry examples - Google TPU, Microsoft Brainwave, Nvidia Tensor Core

5

ML computational cost over years

AI and Compute, OpenAI & Strubell et al.

Edge Intelligence: Efficiency Gap

➢Case study: Object recognition in a smart glass with a state-of-the-art accelerator

*300 GOPs/inference

7Ref: Venkataramani, S., Roy, K. and Raghunathan, A. “Efficient embedded learning for IoT devices.” In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 308-311. IEEE.

Where do the in-efficiencies come from?

Algorithms Hardware Architecture Circuits and Devices

Battery Life

Energy/op 0.5 pJ/op

Energy/frame 0.15 J/frame

Time-to-die (2.1WH)

64 mins

Performance

Frames/sec 13.3

Retinanet DNN* on a smart glass

+Google Edge TPU

Background – In-Memory Computing

➢Definition: Design approach that performs computation close to memory to overcome memory bottlenecks – bandwidth, energy

➢Effective for simple arithmetic - bit-wise operations; fixed-point add, multiply, Truth-tables (ROMs/RAMs)

➢Typical systems have much higher compute throughput than memory bandwidth(s)

➢Lots of chip area are memory components (>=50% in TPU)o Caches (L1, L2, …), Register File, Scratchpad, Buffers

10

TPU Floorplan, ISCA 2017Computer Architecture: A Quantitative Approach

In-Memory Computing for ML

11

WL 1

BL 128BL1

2KB Memory

WL 128

Typical Memory Access

Bits Accessed = 128

Operation = Read/Write

In-Memory Computing

Analog

Bits Accessed = 4*128

Operation = Matrix Vector Multiply

Vout = M * Vin

M:

4×

64

VinT: 1×4

Vout: 1×64

2KB Memory

Bits Accessed = 2*128

Operation = Bit-wise AND

Vout = V1 AND V2

In-Memory Computing

Digital

2KB Memory

V1

V2

Vout

Non Volatile Memory

SRAM

Jain et al. TVLSI’17, Abbrogio et al. Nature’18, Cai et al. Nature Elec.’19, Xue et al. ISSCC’20, Liu et al., ISSCC’20

Biswas et al. ISSCC’18, Valavi et al. JSSC’19, Si et al. ISSCC’19, Jaiswal et al. TVLSI’20, Dong et al. ISSCC’20 (TSMC – 7nm)

Peripherals

ROM/RAM Lee et. al. EDL 2013, Lee et. al. TVLSI 2013,

Machine Learning (Deep Learning)

➢Deep Learning needs – lots of matrix multiplications

12

Bruce Fleischer et al, IBM Research, 2018

➢ Challenge: sustaining deep learning’s insatiable compute demands

Technology: Non-Volatile Memories

I+ = +∑vigiExcitory Input

LCH+

m

LCH–

I– = –∑vigiInhibitory

Input

Preset

NM

FM

IREF

IREF ± |Iin |

Integrator

VOUT

RRAM

Oxygenions

OxygenvacanciesFilament

Metal Oxide

Top Electrode

Bottom Electrode

PCM

Top Electrode

Bottom Electrode

He

ater

PCM

SwitchingRegion

Insulator

VS=0V

VD

VG

VG

t

t

ID ID

Non-volatile Memory Crossbars

IN

V0

V1

VN

...

...

G11 G1j G1N

G21 G22 G2j G2N

GN1 GN2 GNj GNN

I1 I2 Ij

...

...

...

G12

I11

I21

IN1

Ij = Iij = ViGij

...

...

...

Effi

cien

t M

VM

High on-chipstorage

...

...

...

...

...

Inp

ut E

nco

din

g WL D

ecod

er

S&H

Analog to Digital Conversion

Shift and Add

MUX & TIA

Access

Device

S&H S&H S&H

Two-terminal Devices Three-terminal Devices

T1

T2

T1

T2T3

T1

T2

T1

T2

T1

T2T3

PCM RRAM STT-MTJ SHE-MTJ SHE-DWM-MTJ

Two-terminal Devices Three-terminal Devices

T1

T2

T1

T2T3

T1

T2

T1

T2

T1

T2T3

PCM RRAM STT-MTJ SHE-MTJ SHE-DWM-MTJ

NV

M D

evic

es

2-Terminal Selector or Transistor

The PeripheralsVREF+

VREF-

Vin2N comparators

Dec

ode

r

Digital Output

Vin

DAC

SAR Logic

clkIBL

VOUT

RF

Analog to DigitalCurrent to Voltage

Architecture: Spatial scalability

Network on Chip

Massively parallel accelerator –> Amenable to Data-Level Parallelism -> Highly efficient ML inference

Ankit, Roy, et. Al., “PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference”, ASPLOS 2019.

Bit-slicing (weights and inputs)

➢ Bit-slicing: weight slicing and input streaming enable using low precision crossbars and low precision DACs to compose high precision MVMU

17

Wi,j

[1:0]Wi,j

[3:2]

Vi Vi

SnA (Weight slicing)𝑉𝑖 ∗ 𝑊𝑖𝑗= (𝑉𝑖 ∗ 𝑊𝑖𝑗[1: 0]) +(𝑽𝒊 ∗ 𝑾𝒊𝒋 𝟑: 𝟐 ≪ 𝟐)

Wi,

j

Vi=5 or [101]

SnA (Bit streaming)𝑉𝑖 ∗ 𝑊𝑖𝑗 = 1 ∗𝑊𝑖𝑗 +

𝟎 ∗𝑾𝒊𝒋 ≪ 𝟐 + (𝟏 ∗𝑾𝒊𝒋 ≪ 𝟒)

LD $mi1, x, nMVM 0b01ALU_SIG $1, $mo1, nST $1, a

Architecture – Heterogeneous core

a

z

x𝑧 = 𝑀𝑥

𝑎 = 𝑆𝑖𝑔(𝑧)

*MVMU is implemented using bit-slicing as described in PUMA (Ankit et al, ASPLOS 2019)

MVM 0b01

// for all partial sums LD xxx // bring partial sums from other cores

ALU_ADD xxx // accumulate partial sums

ALU_SIG $1, $mo1, n

MLP maps on one MVMU

MLP maps on multiple MVMUs/Cores

1 Layer MLP

Larger matrices require more data movements to compute effective MVM.

Wish List….

Device

High ON/OFF Ratio

Device Linearity

High Endurance

Multi-level cells

Macro

Large Crossbar Size

Low Peripheral Overhead

Good Selector Device

Source/Sink Resistance

Line Resistance

Architecture/Algo

Minimal data movement

MVMs, Vector Operations on NVM

Training/Mapping on Architecture

Sparsity

Challenges: NVM devices

➢Compared to CMOS:✓ Non-volatility

✓ High density

✓ Low leakage

✓ Capable of in-memory compute

×Write energy/latency

➢Current devices are highly non-linear➢Expensive write operations and peripheral

circuitry

➢RON/ROFF ratios are limited to ∼10×

➢RRAM has poor endurance.

➢More than 4-bits/cell is not reliable yet.

Optional Insert Copyright

[1] IEDM, 2019

DeviceHigh ON/OFF Ratio

Device Linearity

High Endurance

Multi-level cells

Property PCM RRAM MTJ CMOS

Multi-level cell Yes Yes No No

Storage Density High High High Low

RON/ROFF High High Low High

Non-volatility Yes Yes Yes No

Leakage Low Low Low High

Write Energy 6 nJ 2 nJ < 1 nJ < 0.1 nJ

Write Latency 150 ns 100 ns 10 ns < 1ns

Endurance 107 cycles 105 cycles 1015 cycles > 1016 cycles

Challenges: NVM Compute Macro

• NVM crossbars can have various non-idealities (parasitics, non-ideal devices)

• ADCs consume 58% and >80% of the total energy and area, respectively

8-bit MVM

Analog Operations Energy (pJ) Dig. Operations Energy (pJ)

MVM Energy 3.84 ALU 25.6

ADC Energy 128 Access (FMA) 480

Other peripherals 12.8

Total 144.6 Total 505.6

Energy

Area

Source: Ankit et al, ASPLOS 2019

Macro

Large Crossbar Size

Low cost peripheral overhead

Good selector device

Source/Sink/Line resistances

• Such non-idealities can introduce varying amounts of functional errors based on different voltage and conductance

• Errors increase with higher crossbar sizes

Modeling Cross-bars – GENIEx, A Generalized Approach to Emulating Non-ideality in Memristive Crossbars

➢𝑓 is a data-dependent non-linear function

➢Neural networks are efficient tools for capturing the close inter-dependence of its inputs

➢Neural network to model the behavior of non-ideal crossbars

𝐼𝑐𝑜𝑙𝑢𝑚𝑛 = 𝑓(𝑉𝑖 , 𝐺𝑖𝑗(𝑉),

𝑅𝑠𝑜𝑢𝑟𝑐𝑒 , 𝑅𝑠𝑖𝑛𝑘, 𝑅𝑤𝑖𝑟𝑒)

GENIEx provides modeling capability for different types of non-idealities

Chakraborty, Ali, Ankit, Roy, “GENIEx: A Generalized Approach to Emulating Non-Ideality in Memristive Xbars using Neural Networks”, DAC 2020

Challenges: ArchitecturesArchitecture

Minimal data movement

Wide variety of operations with NVM

MVMU alone cannot execute ML apps

• Vector operations: require previous writes to crossbar

• Transcendental operations: MVMU (even with writes) can do add/multiply only

Data movement between MVMUs –partial sum, activations is frequent

• Cost of data movement increases across the hierarchy – core, tile, chip

Mapping of workloads to Instructions

Potential Solutions

Improving functional accuracy

• Modifying the training algorithm can recover functional inaccuracies due to crossbar non-idealities

Improving hardware efficiency

Integrated with Open Neural Network

Exchange to execute models from various

DNN frameworks

• Hardware-software codesign to optimize data reuse, support different operations/models can lead to high efficiency – low energy, latency and high utilization.

Tiling and Bit-Slicing

Iterative MVM

Compute Gradient

FP 32 Weights

Conv/Linear LayerInputs

Modified Loss function

Weight UpdateCrossbar Model

(GENIEx)

FOR

WA

RD

PA

SS

BA

CK

WA

RD

PASS

Modeling and Simulation Ecosystem

Functional Modeling

Performance Modeling

Architectural ModelD

Results – Inference Energy (Batch size 1)

Optional Insert Copyright

0.10

1.00

10.00

100.00

1,000.00

10,000.00

100,000.00

1,000,000.00

10,000,000.00

L4 L5 L3 L5 Big

…

LSTM

-…

Vgg

16

Vgg

19

MLP DeepLSTM

WideLSTM

CNN

No

rmal

ize

d E

ner

gy(l

ow

er is

bet

ter)

Skylake Pascal PUMA• CNNs show upto 13.0× reduction

(least).

• High weight reuse, even at batch-size 1.

• MLPs show upto 80.1× reduction.

• No weight reuse, small models.

• LSTMs show upto 2446× reduction.

• Little Weight reuse, large models (billions of parameters).

• PUMA achieves massive reduction in inference energy across all platforms for all benchmarks.

• Similar trends in inference latency reductions are obtained across all benchmarks.

28

In-Memory Computing…SRAMs, DRAMs..

Bit-W

ise B

oole

an

IEEE Transactions

Look

up tab

le

Dot Pro

duc

t

In-Memory Bit-wise Vector Boolean Operations

Modifying the Peripheral Sensing Circuits to Read-out A Vector Boolean Function

Simultaneous Activated WLs

SRAM: In-Memory Bit-wise Boolean Operations

6T-SRAM➢ Staggered WL activation to avoid short circuits between cells.➢ Asymmetric SAs help detect bitwise NAND/NOR/XORs

Agrawal, A et al., 2018. X-sram,. IEEE TCAS-I

X-SRAM: Bit-Wise Vector Boolean Operations

32

RWLi

RWLj

RBL0 RBL1RBLn

Sensing Circuit

Pre-charge Circuit

Data Out

…

…

…VDD

GND

Vo

ltag

e

VDD/2

VDD -VT

0

‘00’, ‘11’

‘10’

‘01’

‘10’

‘01’

XOR

8T-SRAM➢ Decoupled read/write: simultaneous WL activation.➢ 8T cells are wire-NORed. Easy to sense bitwise NOR.➢ Single ended read inverters for sensing.


VDD

B:C

ell2

A:C

ell1

GND

Q1

Q2

RWL1

RWL2

B:C

ell2

A:C

ell1

VDD/2

Q1

Q2

RWL1

RWL2

Q1

Q2

RWL1

RWL2

VDD


Enhanced forIn-memoryComputing

Load RADDR1 RDEST0

Load RADDR2 RDEST1

ADD RDEST0 RDEST1 ROUT

Store RADDR3 ROUT

CiMADD RADDR1 RADDR2 RADDR3

Conventional Instructions

In-memory Instructions

System for In-memory Computing

Bitwise: AND/NAND, OR/NOR, XOR, IMPArithmetic: ADD/MULT with additional circuitry Single cycle Read-Compute-Store



Load RADDR1 RDEST0

Load RADDR2 RDEST1

ADD RDEST0 RDEST1 ROUT

Store RADDR3 ROUT

CiMADD RADDR1 RADDR2 RADDR3

Conventional Instructions

In-memory Instructions

System for In-memory Computing

Enhanced forIn-memoryComputing

Upto 75% lower memory accesses on

AES encryption application

Upto 2x energy benefits, and 8x latency benefits for Binary/XNOR-Net Image classification

Agrawal,Jaiswal,Roy TCAS-I, 2018

i-SRAM: Interleaved Wordlines for In-Memory Vector Boolean Operations

8T-SRAM➢ Each row has two read word-lines, and bit-lines are connected to read and compute

blocks➢ Interleaved 6T cells with bit-lines connected to sense-amplifiers then compute circuits.➢ The circuit schematic for NAND(AND)➢ The circuit scheme for NOR(OR)

W-4Bit-1

W-3Bit-1

W-4Bit-0

W-3Bit-0

Jaiswal, et al. "i-SRAM: Interleaved Wordlines for Vector Boolean Operations Using SRAMs." IEEE Trans.

CAS I: (2020).

➢Circuits and architectures that embody computing principles from the brain o Near-/In-Memory Computingo Approximate and stochastic hardwareo CMOS and Post-CMOS neuro-mimetic devices and

interconnects

36


Bit-W

ise B

oole

an

▪ Interleaved Wordlines

▪ Bitwise NAND/NOR/XOR

i-SRAM

▪ 2 Wordlines Activated


X-SRAM

Agrawal, Amogh, et al. "X-sram: Enabling in-memory boolean computations in cmos static random access memories." IEEE Transactions

on Circuits and Systems I: Regular Papers 65.12 (2018): 4219-4232.

Dot Pro

duc

t

Look

up tab

le

In-memory Dot Product Computations

Memory Peripherals

Dig

ital

to

A

nal

og

Co

nve

rter

s

Analog to Digital

Converters

ROM&

RAM

SRAMArray

Inp

ut

Vec

tor

Output Vector

In-Memory Dot Product Acceleration by use of Current-Mode Computations in SRAMs

8T SRAM as a Multi-Bit Dot Product Engine

42

WWL WWL

RBL

RWL

WBL WBLB

Q QB

M1

M2

SL

Storage cell

RBL

VDD

M1

M2

vi

Q IRBL

Bit-cell RBL

M1

M2

Vbias

Q IRBL

Bit-cellvi

Config-A Config-B

Sensing Circuit Sensing Circuit

(a) (b) (c)

WWL WWL

RBL

RWL

WBL WBLB

Q QB

M1

M2

SL

Storage cell

RBL

VDD

M1

M2

vi

Q IRBL

Bit-cell RBL

M1

M2

Vbias

Q IRBL

Bit-cellvi

Config-A Config-B

Sensing Circuit Sensing Circuit

(a) (b) (c)Modified Read Port

Data Stored in SRAM

Vi on the RWL

8T SRAM Cell

M1M2

Vbias

ΣVi.WiThe Dot Product =

Vi: Analog Voltages on RWL or SL

Wi: Data Stored in SRAM

Summation: KCL addition

Q30 Q20 Q10 Q00

Vin0

Q31 Q21Q11 Q01

Vin1

Sensing Circuit

RBL RBL RBL RBLInputs

Vbias

Vbias

MSB LSBWeights

8W 4W 2W W

Iout = Σf(Vini, Qij)

A. Jaiswal, I. Chakraborty et al , under review in TVLSI, 2018

8T SRAM as a Multi-Bit Dot Product Engine

43

Effect of Line Resistances

➢Profiling the neural network reveals most weight levels at 0.

➢This means current levels are low, mostly in the BLUE region in the error plot.

➢However, there is a ~30% increase in bit-cell area.

➢Multi-VT design can be exploited to reduce area.

# o

f O

ccu

rren

ce

MNIST

A. Jaiswal, I. Chakraborty et al , TVLSI, 2019

➢Circuits and architectures that embody computing principles from the brain o Near-/In-Memory Computingo Approximate and stochastic hardwareo CMOS and Post-CMOS neuro-mimetic devices and

interconnects

46


Bit-W

ise B

oole

an

▪ Interleaved Wordlines


i-SRAM

▪ 2 Wordlines Activated


X-SRAM

Agrawal, Amogh, et al. "X-sram: Enabling in-memory boolean computations in cmos static random access memories." IEEE Transactions

on Circuits and Systems I: Regular Papers 65.12 (2018): 4219-4232.

▪ Multi-bit Dot Products▪ Highly Parallel Operations

8T Dot Product Engine

Akhilesh Jaiswal, Indranil Chakraborty, Amogh Agrawal, Kaushik Roy, “8T SRAM Cell as a Multi-bit Dot Product Engine for Beyond

von-Neumann Computing”arXiv:1802.08601v2 [cs.ET]

▪ In-Memory multi-bit Multiply and Accumulate Engine▪ Analog Multiplication in Functional read followed by analog accumulation

IMAC

Look

up tab

le

Dot Pro

duc

t

https://arxiv.org/abs/1802.08601v2

ROM Embedded RAM

Embedding ROM in CMOS and 1T-1R Arrays Enabling Near-Memory Computing through

Lookup Tables

Embedding ROMs in RAMs (NVMs)

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

Vread

Vread

Vread

Vread

Vread

Vread

RAM Data RAM Data RAM Data

Both ROM and RAM data are stored in the same bit-cell. The read cycle determines whether the ROM or the RAM data is being read.

D. Lee, K. Roy, IEEE EDL 2013

Embedding ROMs in RAMs (STT-MRAM)

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

FLPL

Vread

ROM Data ROM Data ROM Data

Vread

Vread

D. Lee, K. Roy, IEEE EDL 2013

50

Computing in DRAM

➢ Majority-based vector addition

𝑪𝒐𝒖𝒕 = 𝑴𝒂𝒋(𝑨,𝑩, 𝑪𝒊𝒏)

𝑺 = 𝑴𝒂𝒋(𝑨,𝑩, 𝑪𝒊𝒏, 𝑪𝒐𝒖𝒕, 𝑪𝒐𝒖𝒕)

➢ Store data vectors in column-based fashion

➢ The result of A+B addition is stored in the same column

➢ Massively-parallel vector additions in bit-serial mode

➢ No need for carry shifts across bitlines!

➢ Same subarray peripheral circuits

➢ Add 9 reserved rows for compute (<1% area overhead)

In-DRAM Low-cost Bit-serial Addition

Ali, Jaiswal, Roy, “In-Memory Low-Cost Bit-Serial Addition Using Commodity DRAM Technology,” TCAS-I, 2019

Bit-lin

eBL

A = 0

B = 0

C = 1

WL0 = 0

WL1 = 0

WL2 = 0

SAE = 0

½ VDD

½ VDD

(1) Initial state

WL3 = 0

WL4 = 0

D = 0

E = 1

A

B

C

1

1

1

0

½ VDD + Δv

½ VDD

(2) Enable WLs

D

E

1

1

A

B

C

1

1

1

1

Maj (A,B,C,D,E) = 0

VDD

(3) Enable Sense amp

1

1

D

E

Multiple row activation to calculate Maj(A,B,C,D,E)


WLn=0

WLp=0

A =1

B =0

Cin=1

Δv>0

Dual contact cell

1

1

1

0

WLn=1

WLp=0

Cout=1

𝐂𝐨𝐮𝐭 =0

1

1

1

1

𝐂𝐨𝐮𝐭 =0

Cout=1

Cout=1

Cout=1

WLn=0

WLp=0

Cout=1

𝐂𝐨𝐮𝐭 =0

1

1

1

1Cout=1

Cout=1

Cout=1

Calculate 𝑪𝒐𝒖𝒕 and , 𝑪𝒐𝒖𝒕 in one step

𝑺 = 𝑴𝒂𝒋(𝑨,𝑩, 𝑪𝒊𝒏, 𝑪𝒐𝒖𝒕, 𝑪𝒐𝒖𝒕)Remember,


10110110

A0

01100011

B0

Sum

00

A

A

Co

mp

ute

R

ow

sD

ata

Ro

ws

BB

Cin

Cin

Cout

Cout

A7

B7

(0) Initial state

0 0 0 0 0 0 0 0row0 row0

10110110

A0

01100011

B0

Sum

11

00

AA

BB

Cin

Cin

Cout

Cout

A7

B7

(1) Copy A

0 0 0 0 0 0 0 0

AAB

B

Cin

10110110

A0

01100011

B0

Sum

110000

Cin

Cout

Cout

A7

B7

(2) Copy B

row00 0 0 0 0 0 0 0 row0

10110110

A0

01100011

B0

Sum

01000011

Cout

ACout

BCout

Cin

Cout

Cout

A7

B7

(3) Calculate Cout/Cout

0 0 0 0 0 0 0 0

10110110

A0

01100011

B0

1Sum

01010111

Cout

ACout

BCout

Cin

Cout

Cout

A7

B7

(4) Calculate S

row0 0 0 0 0 0 0 0 0

For n-bit vector addition, 4n+1 operations are needed

An example of 1-bit addition of two vectors A and B

00.20.40.60.8

1

Baseline ProposedSystem

No

rmal

ize

d E

xecu

tio

n

Tim

e

Host Processor

(Top-k Sort)

Main Memory &In-DRAM Computing

(Manhattan Distance)

Add/Sub requests Sum/Diff data

kNN Output

Test Data

Train Data

Processor X86, 2GHz

L1 Cache 32KB I- and D-Cache

L2 Cache 2 MB

Main memory1024 MB, DDR3-1600,

1 channel, 1 rank, 8 banks

Using modified gem5 simulator from S.Xu et. al,ICAL’19

11.7x performance improvement

Case Study: Compute-in-DRAM based k-NN Acceleration

➢ Checks k closest samples, Query vector assigned to the group that holds majority among those k samples

➢ Mainly consists of two computation stages: Distance computation and Global top-k sort

➢ The k value: The number of closest samples to be checked ➢ One-to-one distance computation: a large # of memory accesses

Ali, Jaiswal, Roy, “In-Memory Low-Cost Bit-Serial Addition Using Commodity DRAM Technology,” TCAS-I, 2019

Conclusions

➢While possibilities of achieving large improvements in inference latency and energy is possible…

➢There are several challenges…oCross-bar non-idealities: Device non-linearity, access

transistor/selector device, circuit non-idealities (line resistance, source/sink resistance), process variability

oReliability and endurance of non-volatile devicesoHigh write cost for NVMsoA/D and D/A convertersoData movements from partial sumso Training/mapping to the hardware – degradation over time for

some NVM technologiesoNeed for vector operations and floating point operationso SRAMs are large compared to emerging NVMs

Neuro-electronics Research Laboratory

in-memory computing based machine learning accelerators ... roy.pdfkaushik roy purdue university...

Documents