hardware efficiency in neuromorphic computing: devices...
TRANSCRIPT
Hardware Efficiency
in Neuromorphic Computing:
Devices, Circuits, and Algorithms
Yu (Kevin) CaoSchool of Electrical , Computer and Energy Engineering
Arizona State University
Acknowledgement: Jae-sun Seo, Shimeng Yu, Sarma Vrudhula, Visar Berisha (ASU); Maxim Bazhenov (UCSD); Jieping Ye (UM)
2016 SIGDA DASS2016 SIGDA DASS
Neuromorphic Computing On-a-chip Challenges and Needs
– Efficiency gap in computing and energy
Hardware: Resistive Cross-point Array– >2000X speedup; Realistic issues
Algorithm: Inhibition and Noise– Motif of feedforward inhibition
– MNIST: >95% accuracy, 3X saving in network size
Summary
- 2 -
2016 SIGDA DASS2016 SIGDA DASS
From Data to Information
- 3 -
Useful If Tagged and Analyzed
Tagged
AnalyzedBig Gap in Information Analysis!
Big Data Generated
[IDC, December 2012]
77%
2016 SIGDA DASS2016 SIGDA DASS
Success of Machine Learning A top-down approach: better for digital IC
– Pros: mathematical, accurate, scalable– Cons: big data, heavy computing, off-line learning
- 4 -
2016 SIGDA DASS2016 SIGDA DASS
Hardware Implementation Today learning is usually in the data center (cloud)
– Big data– Power hungry– Network issue– Data security
- 5 -
30 frames/s
Edge computing (fog): novel hardware/algorithms needed– Local to the sensor, real-time, reliable, low-power– On-line, personalized learning with continuous data
IBM Jeopardy2880 3.5GHz
P7 cores
Google Cat:16,000 CPU
cores
2016 SIGDA DASS2016 SIGDA DASS
Algorithm complexity– Object diversity (size, pose,
orientation, etc.)– Environmental conditions
(illumination, exposure, occlusion, etc.)
Performance Gap
- 6 -
[V. Narayanan 2012; J. Cong 2015]
Hardware architecture– Memory intensive; – Memory bandwidth
(the von Neumann bottleneck)
GPUCPU
Accelerator
Real-time
2016 SIGDA DASS2016 SIGDA DASS
Advances in Neuro-biophysics A bottom-up approach: better integration with sensors
– Pros: energy efficient, real time, fundamental (10 Nobel Prizes)– Cons: lacking the dynamics, limited scale and accuracy
- 7 -
AnatomyC. Golgi, S. R. Cajal
1906
Ion ChannelR. MacKinnon
2003
Connectome2010 –
Leaky-Integrate-Fire Neuron Model (LIF) Sparseness
2005
pre-spike post-spiketpre tpost
synapse
Δt= tpost - tpre
-100 -50 0 50 100-60-40-20
020406080
100120
∆t<0 LTD
Cond
ucta
nce
Chan
ge ∆
G (%
)
Spike Timing ∆t (ms)
∆t>0 LTP
Spike-Timing-Dependent-Plasticity(STDP) of Synapse
2016 SIGDA DASS2016 SIGDA DASS
Neurobiological Basis of Learning
Reward (supervision): global feedback signal
Inhibition: unsupervised sparse feature extraction
Habituation: stabilize the learning and convergence
Learning: local, feed forward STDP or SRDPon each plastic synapse
Synapse: non-linear, noisy, retention and endurance issues
- 8 -
Monkey, Parietal cortex, Nature Communications, 2015
Honeybee, olfactory system, Nature Neuroscience, 2007
Mouse, Motor cortex, Nature Communications, 2014
2016 SIGDA DASS2016 SIGDA DASS
Brain-inspired Computing
- 9 -
Neuron4-100μm[22nm]
Task Complexity (log)M
achi
ne C
ompl
exity
(Log
)
CPU
Brain
Neural Computer
MicrocircuitFO = 1K-100K
[FO = 4]
Architecture/System100B, 100Hz, 20W, 30% ER/neuron, 95% accuracy
[1.4B, 3.7GHz, 45W, <10-9 BER]
2016 SIGDA DASS2016 SIGDA DASS
Neuromorphic Computing On-a-chip Challenges and Needs
– Efficiency gap in computing and energy
Hardware: Resistive Cross-point Array– >2000X speedup; Realistic issues
Algorithm: Inhibition and Noise– Motif of feedforward inhibition
– MNIST: >95% accuracy, 3X saving in network size
Summary
- 10 -
2016 SIGDA DASS2016 SIGDA DASS
Hardware Acceleration Training / Learning: computationally very expensive
– Involving many parallel operations (data fetch, matrix/vector product, etc.), not suitable to a sequential architecture
– 1.83 minute to process feature extraction of one HD image, with a 8-core 3.4GHz CPU, using sparse coding
103 – 105 speedup required to achieve real-time, on-line training of HD images at 30 frames/second– Conventional hardware is inadequate
- 11 -
GPU10 – 30 X
FPGA10 – 50 X
ASIC102 – 103 X
Beyond CMOS >103 X
2016 SIGDA DASS2016 SIGDA DASS
Resistive Cross-point Array Analog memory to emulate the fully connected synapses
- 12 -
Image Patch X (100)
Dictionary D(1000 x 100)
Extracted Feature Z(1000, sparse)
Original Image
CMOS Periphery circuits for
input/output neurons
Ij
Vi
Rij
RRAM/SRAM for synapse weight
Ij = Σ(1/Rij)⋅Vi
2016 SIGDA DASS2016 SIGDA DASS
A multi-level memory cell to represent the synapse weight
CMOS option: Multi-bit transposable SRAM
Metrics Desired Targets PCM RRAM
Device Dimension <10nm ~20nm ~10nm
Programming Voltage <1V <3V <3V
Programming Speed <μs ~50ns ~10ns
Energy Consumption <10fJ/spike ~10pJ/spike ~100fJ/spike
Multi-level States >100 ~100 ~30
Dynamic Range >5 >100 >100
Synaptic Device
- 13 -
2016 SIGDA DASS2016 SIGDA DASS
RRAM: Switching Dynamics On top of CMOS, at the cross point; non-volatile Cell conductance (1/R or G) for the weight D G is tuned by the voltage and the pulse number (timing)
Issues: variability, non-linearity, process integration
- 14 -
[S. H. Jo et al., Nano Letter 2009]
Vw
2016 SIGDA DASS2016 SIGDA DASS
Circuits for the Algorithm All cells are DC connected, different from the memory The value of Z, X (or r) represented by the number of
voltage pulses; D by the RRAM conductance
- 15 -
Zj
ri
Ir, i
Gij
readwrite (r)
Vr, i
VZ, j
IZ, jread
write (Z)
Input Neuron (X or r)
Dictionary D
Out
put N
euro
n (Z
)
Task Operations
𝑫𝑫 � 𝒁𝒁 𝐼𝐼𝑟𝑟,𝑖𝑖 = �𝑖𝑖
𝐺𝐺𝑖𝑖𝑖𝑖 � 𝑉𝑉𝑍𝑍,𝑖𝑖
𝑫𝑫𝑻𝑻 � 𝒓𝒓 𝐼𝐼𝑍𝑍,𝑖𝑖 = �𝑖𝑖
𝐺𝐺𝑖𝑖𝑖𝑖 � 𝑉𝑉𝑟𝑟,𝑖𝑖𝑖𝑖
𝑫𝑫update
∆𝐺𝐺𝑖𝑖𝑖𝑖= 𝜂𝜂 � 𝑟𝑟 � 𝑍𝑍
2016 SIGDA DASS2016 SIGDA DASS
Read: Integrate-and-Fire A current-to-digital converter, operating as the
Integrate-and-Fire neuron model
- 16 -
ATB
Vreset
Vspike
Vspike
Ir,i (or IZ,j)(0 – 12 μA)
D QR
D QR
8-bit spike counter
Q[5]
Q[6]Q[5] Q[7]
RE
Q[0]
Ccol (Crow)
VpD Q
RD Q
R
Q[6] Q[7]Vin
0.50
0.53
0.0
1.5
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
RE RE
VspikeVspike
Vin Vin
Time (ns)
Volta
ge (V
)
0 4 8 12
Current (µA)
w/o ATB w/ ATB
0
2
4
6
8
Num
ber o
f Pul
ses
I = 6μA I = 1μA
2016 SIGDA DASS2016 SIGDA DASS
Write: SRDP Write RRAM through the spiking rate between input
(X or r) and output (Z) neurons
– Z value for the time window to write– r value for the pulse number (firing rate)
- 17 -
∆𝐺𝐺𝑖𝑖𝑖𝑖∝ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝑾𝑾𝒓𝒓𝑾𝑾𝑾𝑾𝑾𝑾 𝑻𝑻𝑾𝑾𝑻𝑻𝑾𝑾 � 𝑭𝑭𝑾𝑾𝒓𝒓𝑾𝑾𝑭𝑭𝑭𝑭 𝑹𝑹𝑹𝑹𝑾𝑾𝑾𝑾 = 𝜂𝜂 � 𝑍𝑍 � 𝑟𝑟
2016 SIGDA DASS2016 SIGDA DASS
Parallel Write: O(1)
>2000 X speed up at 65nm
- 18 -
0 100 200 300 40080
120
160
200
Cond
ucta
nce
(nS)
Pulse number
100 200 300 400
Potentiation:+3V, 40msDepression:-3V, 10ms
Z1=1
Z2=0.5
R1=16
R2=8
0.0
3.01.5
0.0
3.01.5
0.0
3.01.5
0.0
3.01.5
0.0
3.01.5
0.0
3.01.5
0.0
3.01.5
0.0
3.01.5
Z1=1
Z2=0.5
R1=-16
R2=-8
R>0 R<0
Time Time
Z1
Z2
R1
R2
80120160
Z1⋅|R1|=16
Z1⋅|R2|=8
Z2⋅|R1|=8
Z2⋅|R2|=4
Potentiation: Depression:
D1
D2
D3
D4
R>0 R<0
Initial state
∆D=77
∆D=39
∆D=43
∆D=26
6090
120
Cond
ucta
nce
(nS)
80100120
0 8 16 24 3280
100120
Pulse number
+3V, 100ms -3V, 30ms
R1 R2
Z1
Z2
D1 D2
D3 D4
Col1 Col2
2016 SIGDA DASS2016 SIGDA DASS
Array Integration Peripheral circuits consume significant area Solution: scaling up the array size; non-CMOS neurons
- 19 -
130nm 1T1R array
2016 SIGDA DASS2016 SIGDA DASS
Realistic Device Properties (1) Non-zero off-state conductance; limited levels / precision Fixed-point computing
– Weight (D): 6 bits (64 levels) – Output (Z): 4 bits– On/off ratio needs to be > 25
- 20 -+ ++
DICTIONARY ARRAY
DU
MM
Y C
OLU
MN
-+ + + --
Z INPUT
Di-1Z DiZ Di+1Z
Devices with Minimum
Conductance
Solution: spatial redundancy to solve non-zero off-state
2016 SIGDA DASS2016 SIGDA DASS
Realistic Device Properties (2)
- 21 -
10k 20k 30k 40k 50k 60k
40
50
60
70
80
90
Realistic (with resistivesynaptic device)
Ideal (software)
0 200 400 600 800 100020
30
40
50
60
70
80
90
100
∆C
ondu
ctan
ce (%
)
Number of Write
Decay in RRAM Write (Habituation)
Nonlinear, noisy, poor endurance (habituation in programming)
These hardware problems (variations, unreliable synapse) and performance demands (real time, on-line learning, and mobile) co-exist in biological cortical and sensory systems!
A bio-plausible solution: robust, low power, accurate, on-line
[S. Yu, et al., IEDM 2015]
2016 SIGDA DASS2016 SIGDA DASS
RHINO: A Biomimic Solution Inspired by the olfactory system in insects and the
network motif that is general in biological process
- 22 -
[Nature Review, 2007]
Mushroom Body (MB)
Antennal Lobe (AL)
Kenyon Cells (KCs) 15,000
Lateral Horn Interneurons (LHIs), 100
2016 SIGDA DASS2016 SIGDA DASS
Network Structure and Rules Rewarding for associative
(supervised) learning Inhibition to speed up the
formation of sparsity Habituation (decay in learning
rate) to achieve the convergence
STDP/SRDP rules with rewarding to update W’s
Constructive role of noiseand habituation
No global operations (normalization, etc.)
- 23 -
Input (X), 28 x 28
Output (E), 2000
Inhibition (I), 100
Classifier (C)
Reward
2016 SIGDA DASS2016 SIGDA DASS
Training Procedure Initialization
– WX2E and WX2I are initialized randomly, with 50% connectivity; WI2E are uniformly initialized
Training through global feedback from C, no local iteration Training is full image based, mainly feedforward
- 24 -
Initialize Compute reward; train WE2C
Train excitation WX2E and WX2I
Train inhibitionWI2E
2016 SIGDA DASS2016 SIGDA DASS
Demonstration: MNIST MNIST for handwriting recognition
– Data represented by 0 – 50 spikes– Full image 28 x 28– No pooling or normalization– 50% connectivity of WX2E and WX2I
- 25 -
E: 2000
C: 10
X: 28 x 28
I: 100
0k 10k 20k 30k 40k 50k 60k4
6
8
10
12
14
16
without inhibition with inhibition
0 20 40 60
82
84
86
88
90
92
94
96
RHINO Sparse coding No feedforward I
2016 SIGDA DASS2016 SIGDA DASS
Neuron Firing Rate Homeostatic balance, which controls overfiring of the output
neurons, is essential for learning
- 26 -
Firin
g R
ate
of 2
000
E N
euro
ns
Handwriting Digits (10 categories)0 1 2 3 4 5 6 7 8 9
With homeostatic balance
Without homeostatic balance
0 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 9
Beforetraining
2016 SIGDA DASS2016 SIGDA DASS
Sparsity and Noise for Accuracy Sparsity under thresholding: an appropriate range is necessary Initial randomness: without noise, learning cannot start Habituation, similar as the learning rate, is critical for the convergence
- 27 -
0 10 20 30 40 50 60
20
40
60
80
Accu
racy
(%)
Number of Training Images (k)
2.5% 5% 10% 15% 20%
Percentage of Firing Neurons:
0K 2K 4K 6K 8K
75
80
85
90
100% 40% 80% 20% 60% 10%
2016 SIGDA DASS2016 SIGDA DASS
Size Reduction With 100 Is, the network size of E is reduced by 3X at
the same accuracy of 95% The mechanism is similar to the residual net
- 28 -
w/ inhibition (E + I) w/o inhibition (E only)
[Microsoft, 2015]
LHIs
KCs
AL
+ ‒
+
2016 SIGDA DASS2016 SIGDA DASS
Results Comparison
- 29 -
Reference Input Data format and precision
Learning rules
Number of neurons
Number of parameters
Number of images Accuracy
Mushroom body 28x28 Spike Rewarded STDP 50000 5E5 60000 87%
Two layer SNN (Querlioz, 2013) 28x28 Spike STDP 300 2.4E5 60000x3 93.5%
Unsupervised ETH 28x28 Spike STDP 6400 4.6E7 200000 95.0%
This work 28x28 Spike rate in a 50 window
RewardedSRDP 2100 8.4E5 60000 95.0%
This work 28x28 Spike rate in a 50 window
Rewarded SRDP 6000 2.4E6 60000 96.2%
Spiking RBM 28x28 Spike rate Contrastive divergence 500 3.9E5 20000 92.6%
Sparse Coding 10x10 patch 3-bit number Gradient 300 3E4 60000x10 94.0%
Two layer NN 28x28 Floating number
Gradient descent 1000 7.8E5 60000 95.5%
Spiking CNN 28x28 Spike timing Regenerative learning 5.6E4 1.2E5 60000 99.08%
2016 SIGDA DASS2016 SIGDA DASS
Summary Resistive cross-point array: an analog platform for
synaptic operations– >2000X speedup; Accuracy degradation due to device issues
RHINO: a bio-inspired algorithm– >95% accuracy and 3X size reduction, using “negative” effects
Future: brain-inspired hardware-algorithm for low precision, compact network, and high energy efficiency
- 30 -