in-memory computing based machine learning accelerators ... roy.pdfkaushik roy purdue university...
TRANSCRIPT
In-Memory Computing based Machine Learning Accelerators: Opportunities and Challenges
KAUSHIK ROY
PURDUE UNIVERSITY
1
Challenge: Computational Demands and Technology Scaling
➢Machine Learning (ML) has humongous computational cost
➢Dennard scaling has ended -> General-purpose systems have hit performance/energy wall
➢Domain specific acceleration (hardware/software) is the keyo Industry examples - Google TPU, Microsoft Brainwave, Nvidia Tensor Core
5
ML computational cost over years
AI and Compute, OpenAI & Strubell et al.
Edge Intelligence: Efficiency Gap
➢Case study: Object recognition in a smart glass with a state-of-the-art accelerator
*300 GOPs/inference
7Ref: Venkataramani, S., Roy, K. and Raghunathan, A. “Efficient embedded learning for IoT devices.” In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 308-311. IEEE.
Where do the in-efficiencies come from?
Algorithms Hardware Architecture Circuits and Devices
Battery Life
Energy/op 0.5 pJ/op
Energy/frame 0.15 J/frame
Time-to-die (2.1WH)
64 mins
Performance
Frames/sec 13.3
Retinanet DNN* on a smart glass
+Google Edge TPU
Background – In-Memory Computing
➢Definition: Design approach that performs computation close to memory to overcome memory bottlenecks – bandwidth, energy
➢Effective for simple arithmetic - bit-wise operations; fixed-point add, multiply, Truth-tables (ROMs/RAMs)
➢Typical systems have much higher compute throughput than memory bandwidth(s)
➢Lots of chip area are memory components (>=50% in TPU)o Caches (L1, L2, …), Register File, Scratchpad, Buffers
10
TPU Floorplan, ISCA 2017Computer Architecture: A Quantitative Approach
In-Memory Computing for ML
11
WL 1
BL 128BL1
2KB Memory
WL 128
Typical Memory Access
Bits Accessed = 128
Operation = Read/Write
In-Memory Computing
Analog
Bits Accessed = 4*128
Operation = Matrix Vector Multiply
Vout = M * Vin
M:
4×
64
VinT: 1×4
Vout: 1×64
2KB Memory
Bits Accessed = 2*128
Operation = Bit-wise AND
Vout = V1 AND V2
In-Memory Computing
Digital
2KB Memory
V1
V2
Vout
Non Volatile Memory
SRAM
Jain et al. TVLSI’17, Abbrogio et al. Nature’18, Cai et al. Nature Elec.’19, Xue et al. ISSCC’20, Liu et al., ISSCC’20
Biswas et al. ISSCC’18, Valavi et al. JSSC’19, Si et al. ISSCC’19, Jaiswal et al. TVLSI’20, Dong et al. ISSCC’20 (TSMC – 7nm)
Peripherals
ROM/RAM Lee et. al. EDL 2013, Lee et. al. TVLSI 2013,
Machine Learning (Deep Learning)
➢Deep Learning needs – lots of matrix multiplications
12
Bruce Fleischer et al, IBM Research, 2018
➢ Challenge: sustaining deep learning’s insatiable compute demands
Technology: Non-Volatile Memories
I+ = +∑vigiExcitory Input
LCH+
m
LCH–
I– = –∑vigiInhibitory
Input
Preset
NM
FM
IREF
IREF ± |Iin |
Integrator
VOUT
RRAM
Oxygenions
OxygenvacanciesFilament
Metal Oxide
Top Electrode
Bottom Electrode
PCM
Top Electrode
Bottom Electrode
He
ater
PCM
SwitchingRegion
Insulator
VS=0V
VD
VG
VG
t
t
ID ID
Non-volatile Memory Crossbars
IN
V0
V1
VN
...
...
G11 G1j G1N
G21 G22 G2j G2N
GN1 GN2 GNj GNN
I1 I2 Ij
...
...
...
G12
I11
I21
IN1
Ij = Iij = ViGij
...
...
...
Effi
cien
t M
VM
High on-chipstorage
...
...
...
...
...
Inp
ut E
nco
din
g WL D
ecod
er
S&H
Analog to Digital Conversion
Shift and Add
MUX & TIA
Access
Device
S&H S&H S&H
Two-terminal Devices Three-terminal Devices
T1
T2
T1
T2T3
T1
T2
T1
T2
T1
T2T3
PCM RRAM STT-MTJ SHE-MTJ SHE-DWM-MTJ
Two-terminal Devices Three-terminal Devices
T1
T2
T1
T2T3
T1
T2
T1
T2
T1
T2T3
PCM RRAM STT-MTJ SHE-MTJ SHE-DWM-MTJ
NV
M D
evic
es
2-Terminal Selector or Transistor
The PeripheralsVREF+
VREF-
Vin2N comparators
Dec
ode
r
Digital Output
Vin
DAC
SAR Logic
clkIBL
VOUT
RF
Analog to DigitalCurrent to Voltage
Architecture: Spatial scalability
Network on Chip
Massively parallel accelerator –> Amenable to Data-Level Parallelism -> Highly efficient ML inference
Ankit, Roy, et. Al., “PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference”, ASPLOS 2019.
Bit-slicing (weights and inputs)
➢ Bit-slicing: weight slicing and input streaming enable using low precision crossbars and low precision DACs to compose high precision MVMU
17
Wi,j
[1:0]Wi,j
[3:2]
Vi Vi
SnA (Weight slicing)𝑉𝑖 ∗ 𝑊𝑖𝑗= (𝑉𝑖 ∗ 𝑊𝑖𝑗[1: 0]) +(𝑽𝒊 ∗ 𝑾𝒊𝒋 𝟑: 𝟐 ≪ 𝟐)
Wi,
j
Vi=5 or [101]
SnA (Bit streaming)𝑉𝑖 ∗ 𝑊𝑖𝑗 = 1 ∗𝑊𝑖𝑗 +
𝟎 ∗𝑾𝒊𝒋 ≪ 𝟐 + (𝟏 ∗𝑾𝒊𝒋 ≪ 𝟒)
LD $mi1, x, nMVM 0b01ALU_SIG $1, $mo1, nST $1, a
Architecture – Heterogeneous core
a
z
x𝑧 = 𝑀𝑥
𝑎 = 𝑆𝑖𝑔(𝑧)
*MVMU is implemented using bit-slicing as described in PUMA (Ankit et al, ASPLOS 2019)
MVM 0b01
// for all partial sums LD xxx // bring partial sums from other cores
ALU_ADD xxx // accumulate partial sums
ALU_SIG $1, $mo1, n
MLP maps on one MVMU
MLP maps on multiple MVMUs/Cores
1 Layer MLP
Larger matrices require more data movements to compute effective MVM.
Wish List….
Device
High ON/OFF Ratio
Device Linearity
High Endurance
Multi-level cells
Macro
Large Crossbar Size
Low Peripheral Overhead
Good Selector Device
Source/Sink Resistance
Line Resistance
Architecture/Algo
Minimal data movement
MVMs, Vector Operations on NVM
Training/Mapping on Architecture
Sparsity
Challenges: NVM devices
➢Compared to CMOS:✓ Non-volatility
✓ High density
✓ Low leakage
✓ Capable of in-memory compute
×Write energy/latency
➢Current devices are highly non-linear➢Expensive write operations and peripheral
circuitry
➢RON/ROFF ratios are limited to ∼10×
➢RRAM has poor endurance.
➢More than 4-bits/cell is not reliable yet.
Optional Insert Copyright
[1] IEDM, 2019
DeviceHigh ON/OFF Ratio
Device Linearity
High Endurance
Multi-level cells
Property PCM RRAM MTJ CMOS
Multi-level cell Yes Yes No No
Storage Density High High High Low
RON/ROFF High High Low High
Non-volatility Yes Yes Yes No
Leakage Low Low Low High
Write Energy 6 nJ 2 nJ < 1 nJ < 0.1 nJ
Write Latency 150 ns 100 ns 10 ns < 1ns
Endurance 107 cycles 105 cycles 1015 cycles > 1016 cycles
Challenges: NVM Compute Macro
• NVM crossbars can have various non-idealities (parasitics, non-ideal devices)
• ADCs consume 58% and >80% of the total energy and area, respectively
8-bit MVM
Analog Operations Energy (pJ) Dig. Operations Energy (pJ)
MVM Energy 3.84 ALU 25.6
ADC Energy 128 Access (FMA) 480
Other peripherals 12.8
Total 144.6 Total 505.6
Energy
Area
Source: Ankit et al, ASPLOS 2019
Macro
Large Crossbar Size
Low cost peripheral overhead
Good selector device
Source/Sink/Line resistances
• Such non-idealities can introduce varying amounts of functional errors based on different voltage and conductance
• Errors increase with higher crossbar sizes
Modeling Cross-bars – GENIEx, A Generalized Approach to Emulating Non-ideality in Memristive Crossbars
➢𝑓 is a data-dependent non-linear function
➢Neural networks are efficient tools for capturing the close inter-dependence of its inputs
➢Neural network to model the behavior of non-ideal crossbars
𝐼𝑐𝑜𝑙𝑢𝑚𝑛 = 𝑓(𝑉𝑖 , 𝐺𝑖𝑗(𝑉),
𝑅𝑠𝑜𝑢𝑟𝑐𝑒 , 𝑅𝑠𝑖𝑛𝑘, 𝑅𝑤𝑖𝑟𝑒)
GENIEx provides modeling capability for different types of non-idealities
Chakraborty, Ali, Ankit, Roy, “GENIEx: A Generalized Approach to Emulating Non-Ideality in Memristive Xbars using Neural Networks”, DAC 2020
Challenges: ArchitecturesArchitecture
Minimal data movement
Wide variety of operations with NVM
MVMU alone cannot execute ML apps
• Vector operations: require previous writes to crossbar
• Transcendental operations: MVMU (even with writes) can do add/multiply only
Data movement between MVMUs –partial sum, activations is frequent
• Cost of data movement increases across the hierarchy – core, tile, chip
Mapping of workloads to Instructions
Potential Solutions
Improving functional accuracy
• Modifying the training algorithm can recover functional inaccuracies due to crossbar non-idealities
Improving hardware efficiency
Integrated with Open Neural Network
Exchange to execute models from various
DNN frameworks
• Hardware-software codesign to optimize data reuse, support different operations/models can lead to high efficiency – low energy, latency and high utilization.
Tiling and Bit-Slicing
Iterative MVM
Compute Gradient
FP 32 Weights
Conv/Linear LayerInputs
Modified Loss function
Weight UpdateCrossbar Model
(GENIEx)
FOR
WA
RD
PA
SS
BA
CK
WA
RD
PASS
Modeling and Simulation Ecosystem
Functional Modeling
Performance Modeling
Architectural ModelD
Results – Inference Energy (Batch size 1)
Optional Insert Copyright
0.10
1.00
10.00
100.00
1,000.00
10,000.00
100,000.00
1,000,000.00
10,000,000.00
L4 L5 L3 L5 Big
…
LSTM
-…
Vgg
16
Vgg
19
MLP DeepLSTM
WideLSTM
CNN
No
rmal
ize
d E
ner
gy(l
ow
er is
bet
ter)
Skylake Pascal PUMA• CNNs show upto 13.0× reduction
(least).
• High weight reuse, even at batch-size 1.
• MLPs show upto 80.1× reduction.
• No weight reuse, small models.
• LSTMs show upto 2446× reduction.
• Little Weight reuse, large models (billions of parameters).
• PUMA achieves massive reduction in inference energy across all platforms for all benchmarks.
• Similar trends in inference latency reductions are obtained across all benchmarks.
28
In-Memory Computing…SRAMs, DRAMs..
Bit-W
ise B
oole
an
IEEE Transactions
Look
up tab
le
Dot Pro
duc
t
In-Memory Bit-wise Vector Boolean Operations
Modifying the Peripheral Sensing Circuits to Read-out A Vector Boolean Function
Simultaneous Activated WLs
SRAM: In-Memory Bit-wise Boolean Operations
6T-SRAM➢ Staggered WL activation to avoid short circuits between cells.➢ Asymmetric SAs help detect bitwise NAND/NOR/XORs
Agrawal, A et al., 2018. X-sram,. IEEE TCAS-I
X-SRAM: Bit-Wise Vector Boolean Operations
32
RWLi
RWLj
RBL0 RBL1RBLn
Sensing Circuit
Pre-charge Circuit
Data Out
…
…
…VDD
GND
Vo
ltag
e
VDD/2
VDD -VT
0
‘00’, ‘11’
‘10’
‘01’
‘10’
‘01’
XOR
8T-SRAM➢ Decoupled read/write: simultaneous WL activation.➢ 8T cells are wire-NORed. Easy to sense bitwise NOR.➢ Single ended read inverters for sensing.
Agrawal, A et al., 2018. X-sram,. IEEE TCAS-I
VDD
B:C
ell2
A:C
ell1
GND
Q1
Q2
RWL1
RWL2
B:C
ell2
A:C
ell1
VDD/2
Q1
Q2
RWL1
RWL2
Q1
Q2
RWL1
RWL2
VDD
X-SRAM: Bit-Wise Vector Boolean Operations
Enhanced forIn-memoryComputing
Load RADDR1 RDEST0
Load RADDR2 RDEST1
ADD RDEST0 RDEST1 ROUT
Store RADDR3 ROUT
CiMADD RADDR1 RADDR2 RADDR3
Conventional Instructions
In-memory Instructions
System for In-memory Computing
Bitwise: AND/NAND, OR/NOR, XOR, IMPArithmetic: ADD/MULT with additional circuitry Single cycle Read-Compute-Store
Agrawal, A et al., 2018. X-sram,. IEEE TCAS-I
X-SRAM: Bit-Wise Vector Boolean Operations
Load RADDR1 RDEST0
Load RADDR2 RDEST1
ADD RDEST0 RDEST1 ROUT
Store RADDR3 ROUT
CiMADD RADDR1 RADDR2 RADDR3
Conventional Instructions
In-memory Instructions
System for In-memory Computing
Enhanced forIn-memoryComputing
Upto 75% lower memory accesses on
AES encryption application
Upto 2x energy benefits, and 8x latency benefits for Binary/XNOR-Net Image classification
Agrawal,Jaiswal,Roy TCAS-I, 2018
i-SRAM: Interleaved Wordlines for In-Memory Vector Boolean Operations
8T-SRAM➢ Each row has two read word-lines, and bit-lines are connected to read and compute
blocks➢ Interleaved 6T cells with bit-lines connected to sense-amplifiers then compute circuits.➢ The circuit schematic for NAND(AND)➢ The circuit scheme for NOR(OR)
W-4Bit-1
W-3Bit-1
W-4Bit-0
W-3Bit-0
Jaiswal, et al. "i-SRAM: Interleaved Wordlines for Vector Boolean Operations Using SRAMs." IEEE Trans.
CAS I: (2020).
➢Circuits and architectures that embody computing principles from the brain o Near-/In-Memory Computingo Approximate and stochastic hardwareo CMOS and Post-CMOS neuro-mimetic devices and
interconnects
36
In-Memory Computing…SRAMs, DRAMs..
Bit-W
ise B
oole
an
▪ Interleaved Wordlines
▪ Bitwise NAND/NOR/XOR
i-SRAM
▪ 2 Wordlines Activated
▪ Bitwise NAND/NOR/XOR
X-SRAM
Agrawal, Amogh, et al. "X-sram: Enabling in-memory boolean computations in cmos static random access memories." IEEE Transactions
on Circuits and Systems I: Regular Papers 65.12 (2018): 4219-4232.
Dot Pro
duc
t
Look
up tab
le
In-memory Dot Product Computations
Memory Peripherals
Dig
ital
to
A
nal
og
Co
nve
rter
s
Analog to Digital
Converters
ROM&
RAM
SRAMArray
Inp
ut
Vec
tor
Output Vector
In-Memory Dot Product Acceleration by use of Current-Mode Computations in SRAMs
8T SRAM as a Multi-Bit Dot Product Engine
42
WWL WWL
RBL
RWL
WBL WBLB
Q QB
M1
M2
SL
Storage cell
RBL
VDD
M1
M2
vi
Q IRBL
Bit-cell RBL
M1
M2
Vbias
Q IRBL
Bit-cellvi
Config-A Config-B
Sensing Circuit Sensing Circuit
(a) (b) (c)
WWL WWL
RBL
RWL
WBL WBLB
Q QB
M1
M2
SL
Storage cell
RBL
VDD
M1
M2
vi
Q IRBL
Bit-cell RBL
M1
M2
Vbias
Q IRBL
Bit-cellvi
Config-A Config-B
Sensing Circuit Sensing Circuit
(a) (b) (c)Modified Read Port
Data Stored in SRAM
Vi on the RWL
8T SRAM Cell
M1M2
Vbias
ΣVi.WiThe Dot Product =
Vi: Analog Voltages on RWL or SL
Wi: Data Stored in SRAM
Summation: KCL addition
Q30 Q20 Q10 Q00
Vin0
Q31 Q21Q11 Q01
Vin1
Sensing Circuit
RBL RBL RBL RBLInputs
Vbias
Vbias
MSB LSBWeights
8W 4W 2W W
Iout = Σf(Vini, Qij)
A. Jaiswal, I. Chakraborty et al , under review in TVLSI, 2018
8T SRAM as a Multi-Bit Dot Product Engine
43
Effect of Line Resistances
➢Profiling the neural network reveals most weight levels at 0.
➢This means current levels are low, mostly in the BLUE region in the error plot.
➢However, there is a ~30% increase in bit-cell area.
➢Multi-VT design can be exploited to reduce area.
# o
f O
ccu
rren
ce
MNIST
A. Jaiswal, I. Chakraborty et al , TVLSI, 2019
➢Circuits and architectures that embody computing principles from the brain o Near-/In-Memory Computingo Approximate and stochastic hardwareo CMOS and Post-CMOS neuro-mimetic devices and
interconnects
46
In-Memory Computing…SRAMs, DRAMs..
Bit-W
ise B
oole
an
▪ Interleaved Wordlines
▪ Bitwise NAND/NOR/XOR
i-SRAM
▪ 2 Wordlines Activated
▪ Bitwise NAND/NOR/XOR
X-SRAM
Agrawal, Amogh, et al. "X-sram: Enabling in-memory boolean computations in cmos static random access memories." IEEE Transactions
on Circuits and Systems I: Regular Papers 65.12 (2018): 4219-4232.
▪ Multi-bit Dot Products▪ Highly Parallel Operations
8T Dot Product Engine
Akhilesh Jaiswal, Indranil Chakraborty, Amogh Agrawal, Kaushik Roy, “8T SRAM Cell as a Multi-bit Dot Product Engine for Beyond
von-Neumann Computing”arXiv:1802.08601v2 [cs.ET]
▪ In-Memory multi-bit Multiply and Accumulate Engine▪ Analog Multiplication in Functional read followed by analog accumulation
IMAC
Look
up tab
le
Dot Pro
duc
t
ROM Embedded RAM
Embedding ROM in CMOS and 1T-1R Arrays Enabling Near-Memory Computing through
Lookup Tables
Embedding ROMs in RAMs (NVMs)
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
Vread
Vread
Vread
Vread
Vread
Vread
RAM Data RAM Data RAM Data
Both ROM and RAM data are stored in the same bit-cell. The read cycle determines whether the ROM or the RAM data is being read.
D. Lee, K. Roy, IEEE EDL 2013
Embedding ROMs in RAMs (STT-MRAM)
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
FLPL
Vread
ROM Data ROM Data ROM Data
Vread
Vread
D. Lee, K. Roy, IEEE EDL 2013
50
Computing in DRAM
➢ Majority-based vector addition
𝑪𝒐𝒖𝒕 = 𝑴𝒂𝒋(𝑨,𝑩, 𝑪𝒊𝒏)
𝑺 = 𝑴𝒂𝒋(𝑨,𝑩, 𝑪𝒊𝒏, 𝑪𝒐𝒖𝒕, 𝑪𝒐𝒖𝒕)
➢ Store data vectors in column-based fashion
➢ The result of A+B addition is stored in the same column
➢ Massively-parallel vector additions in bit-serial mode
➢ No need for carry shifts across bitlines!
➢ Same subarray peripheral circuits
➢ Add 9 reserved rows for compute (<1% area overhead)
In-DRAM Low-cost Bit-serial Addition
Ali, Jaiswal, Roy, “In-Memory Low-Cost Bit-Serial Addition Using Commodity DRAM Technology,” TCAS-I, 2019
Bit-lin
eBL
A = 0
B = 0
C = 1
WL0 = 0
WL1 = 0
WL2 = 0
SAE = 0
½ VDD
½ VDD
(1) Initial state
WL3 = 0
WL4 = 0
D = 0
E = 1
A
B
C
1
1
1
0
½ VDD + Δv
½ VDD
(2) Enable WLs
D
E
1
1
A
B
C
1
1
1
1
Maj (A,B,C,D,E) = 0
VDD
(3) Enable Sense amp
1
1
D
E
Multiple row activation to calculate Maj(A,B,C,D,E)
In-DRAM Low-cost Bit-serial Addition
WLn=0
WLp=0
A =1
B =0
Cin=1
Δv>0
Dual contact cell
1
1
1
0
WLn=1
WLp=0
Cout=1
𝐂𝐨𝐮𝐭 =0
1
1
1
1
𝐂𝐨𝐮𝐭 =0
Cout=1
Cout=1
Cout=1
WLn=0
WLp=0
Cout=1
𝐂𝐨𝐮𝐭 =0
1
1
1
1Cout=1
Cout=1
Cout=1
Calculate 𝑪𝒐𝒖𝒕 and , 𝑪𝒐𝒖𝒕 in one step
𝑺 = 𝑴𝒂𝒋(𝑨,𝑩, 𝑪𝒊𝒏, 𝑪𝒐𝒖𝒕, 𝑪𝒐𝒖𝒕)Remember,
In-DRAM Low-cost Bit-serial Addition
10110110
A0
01100011
B0
Sum
00
A
A
Co
mp
ute
R
ow
sD
ata
Ro
ws
BB
Cin
Cin
Cout
Cout
A7
B7
(0) Initial state
0 0 0 0 0 0 0 0row0 row0
10110110
A0
01100011
B0
Sum
11
00
AA
BB
Cin
Cin
Cout
Cout
A7
B7
(1) Copy A
0 0 0 0 0 0 0 0
AAB
B
Cin
10110110
A0
01100011
B0
Sum
110000
Cin
Cout
Cout
A7
B7
(2) Copy B
row00 0 0 0 0 0 0 0 row0
10110110
A0
01100011
B0
Sum
01000011
Cout
ACout
BCout
Cin
Cout
Cout
A7
B7
(3) Calculate Cout/Cout
0 0 0 0 0 0 0 0
10110110
A0
01100011
B0
1Sum
01010111
Cout
ACout
BCout
Cin
Cout
Cout
A7
B7
(4) Calculate S
row0 0 0 0 0 0 0 0 0
For n-bit vector addition, 4n+1 operations are needed
An example of 1-bit addition of two vectors A and B
00.20.40.60.8
1
Baseline ProposedSystem
No
rmal
ize
d E
xecu
tio
n
Tim
e
Host Processor
(Top-k Sort)
Main Memory &In-DRAM Computing
(Manhattan Distance)
Add/Sub requests Sum/Diff data
kNN Output
Test Data
Train Data
Processor X86, 2GHz
L1 Cache 32KB I- and D-Cache
L2 Cache 2 MB
Main memory1024 MB, DDR3-1600,
1 channel, 1 rank, 8 banks
Using modified gem5 simulator from S.Xu et. al,ICAL’19
11.7x performance improvement
Case Study: Compute-in-DRAM based k-NN Acceleration
➢ Checks k closest samples, Query vector assigned to the group that holds majority among those k samples
➢ Mainly consists of two computation stages: Distance computation and Global top-k sort
➢ The k value: The number of closest samples to be checked ➢ One-to-one distance computation: a large # of memory accesses
Ali, Jaiswal, Roy, “In-Memory Low-Cost Bit-Serial Addition Using Commodity DRAM Technology,” TCAS-I, 2019
Conclusions
➢While possibilities of achieving large improvements in inference latency and energy is possible…
➢There are several challenges…oCross-bar non-idealities: Device non-linearity, access
transistor/selector device, circuit non-idealities (line resistance, source/sink resistance), process variability
oReliability and endurance of non-volatile devicesoHigh write cost for NVMsoA/D and D/A convertersoData movements from partial sumso Training/mapping to the hardware – degradation over time for
some NVM technologiesoNeed for vector operations and floating point operationso SRAMs are large compared to emerging NVMs
Neuro-electronics Research Laboratory