nvm - ibm · nvm-based networks must achieve competitive performance levels (e.g., classification...
TRANSCRIPT
Large-scale neural networks implemented withnon-volatile memory as the synaptic weight
element: impact of conductance responseSeverin Sidler∗‡, Irem Boybat∗‡, Robert M. Shelby∗, Pritish Narayanan∗, Junwoo Jang∗†,
Alessandro Fumarola∗‡, Kibong Moon†, Yusuf Leblebici‡, Hyunsang Hwang†, and Geoffrey W. Burr∗
∗IBM Research–Almaden, 650 Harry Road, San Jose, CA 95120, Tel: (408) 927–1512, Email: [email protected]†Department of Material Science and Engineering, Pohang University of Science and Technology, Pohang 790-784, Korea
‡EPFL, Lausanne, CH–1015 Switzerland
Abstract—We assess the impact of the conductance responseof Non-Volatile Memory (NVM) devices employed as the synap-tic weight element for on-chip acceleration of the training oflarge-scale artificial neural networks (ANN). We briefly reviewour previous work towards achieving competitive performance(classification accuracies) for such ANN with both Phase-ChangeMemory (PCM) [1], [2] and non-filamentary ReRAM basedon PrCaMnO (PCMO) [3], and towards assessing the potentialadvantages for ML training over GPU–based hardware in termsof speed (up to 25× faster) and power (from 120–2850× lowerpower) [4]. We then discuss the “jump-table” concept, previouslyintroduced to model real-world NVM such as PCM [1] or PCMO,to describe the full cumulative distribution function (CDF) ofconductance-change at each device conductance value, for bothpotentiation (SET) and depression (RESET). Using several typesof artificially–constructed jump-tables, we assess the relativeimportance of deviations from an ideal NVM with perfectly linearconductance response.
I. INTRODUCTION
By performing computation at the location of data, non-VonNeumann (non–VN) computing ought to provide significantpower and speed benefits (Fig. 1) on specific and assumablyimportant tasks. For one such non–VN approach — on-chiptraining of large-scale ANN using NVM-based synapses [1]–[4] — viability will require several things. First, despite theinherent imperfections of NVM devices such as Phase ChangeMemory (PCM) [1], [2] or Resistive RAM (RRAM) [3], suchNVM-based networks must achieve competitive performancelevels (e.g., classification accuracies) when compared to ANN
CPU Memory
BUS
Von Neumann“Bottleneck”(a) (b)
Fig. 1. In the Von Neumann architecture (a), data (both operations andoperands) must move to and from the dedicated Central Processing Unit (CPU)along a bus. In contrast, in a Non–Von Neumann architecture, distributedcomputations take place at the location of the data, reducing the time andenergy spent moving data around [1].
trained using CPUs or GPUs. Second, the benefits of perform-ing computation at the data (Fig. 2) must confer a decidedadvantage in either training power or speed (or preferably,both). And finally, any on-chip accelerator should be appli-cable towards networks of different types (fully–connected“Deep” NN or Convolutional NN) and/or be reconfigurablefor networks of different shapes (wide with many neurons, ordeep with many layers).
We briefly review our work [1]–[4] in assessing the accu-racy, speed and power potential of on-chip NVM–based ML.
A. Potential for competitive classification accuracies
Using 2 phase-change memory (PCM) devices per synapse,we demonstrated a 3–layer perceptron with 164,885 synapses[1], trained with backpropagation [5] on a subset (5000examples) of the MNIST database of handwritten digits [6](Fig 3), using a modified weight-update rule compatiblewith NVM+selector crossbar arrays [1]. We proved that thismodification does not degrade the high “test” (generalization)accuracies such a 3–layer network inherently delivers on thisproblem when trained in software [1]. However, nonlinearityand asymmetry in PCM conductance response limited both“training” and “test” accuracy in these initial experiments to82–83% [1] (Fig. 4).
Selector deviceSynaptic weight
NVMN1
N2
N1
M1
M2
P
P1
O1
pairsConductance
1
N2
P2
O2
G+ G–
Nn
Oo
+ ‐ + ‐Nn Mm
MmMm PpM1
+
Fig. 2. Neuro-inspired non-Von Neumann computing [1]–[4], in whichneurons activate each other through dense networks of programmable synapticweights, can be implemented using dense crossbar arrays of nonvolatilememory (NVM) and selector device-pairs [1].
Asymmetry (between the gentle conductance increases ofPCM partial–SET and the abruptness of PCM RESET) wasmitigated by an occasional RESET strategy, which could beboth infrequent and inaccurate [1]. While in these initialexperiments, network parameters such as learning rate η hadto be tuned very carefully, a modified ‘LG’ algorithm offeredwider tolerance to η, higher classification accuracies, andlower training energy [4].
Tolerancing results showed that all NVM-based ANN canbe expected to be highly resilient to random effects (NVMvariability, yield, and stochasticity), but highly sensitive to“gradient” effects that act to steer all synaptic weights[1]. We showed that a bidirectional NVM with a symmetric,linear conductance response of finite but large dynamic range(e.g., each conductance step is relatively small) can deliver thesame high classification accuracies on the MNIST digits asa conventional, software-based implementation (Fig. 5). Onekey observation is the importance of avoiding constraints onweight magnitude that arise when the two conductances areeither both small or both large — e.g., synapses should remainin the center stripe of the “G-diamond” [2].
In this paper, we extend upon this observation to explorethe impact of specific deviations from such an idealized linear
Fig. 3. In forward evaluation of a multilayer perceptron, each layer’s neuronsdrive the next layer through weights wij and a nonlinearity f(). Input neuronsare driven by input (for instance, pixels from successive MNIST images(cropped to 22×24)); the 10 output neurons classify which digit was presented[1].
90
100
%] 500 x 661 PCM = (2 PCM/synapse * 164,885 synapses) + 730 unused PCM
80
90
acy
[%
Experiment
60
70
accu
r
8090
100 Matched simulation
40
50
ntal
a
506070
20
30
rim
en
Map offinal 10
203040
10
20
Training epochExpe
finalPCM 0 5 10 15 200
10
conductances (5000 imageseach)
0 1 20
g p each)
Fig. 4. Training accuracy for a 3–layer perceptron of 164,885 hardware-synapses [1], with all weight operations taking place on a 500 × 661 array ofmushroom-cell PCM devices. Also shown is a matched computer simulationof this NN, using parameters extracted from the experiment [1].
conductance response.
B. Comparative analysis of speed and power
We have also assessed the potential advantages, in termsof speed and power, of on-chip machine learning (ML)of large-scale artificial neural networks (ANN) using Non-Volatile Memory (NVM)-based synapses, in comparison toconventional GPU–based hardware [4].
Under moderately-aggressive assumptions for parallel–readand –write speed, PCM-based on-chip machine learning canpotentially offer lower power and faster training (per ANNexample) than GPU-based training for both large and smallnetworks (Fig. 6), even with the time and energy required foroccasional RESET (forced by the large asymmetry betweengentle partial-SET and abrupt RESET in PCM). Critical hereis the design of area-efficient read/write circuitry, so that manycopies of this circuitry operate in parallel (each handling asmall number of columns (rows), cs).
II. JUMP-TABLE CONCEPT
A useful concept in modeling the behavior of real NVMdevices for neuromorphic applications is the concept of a
100Training set Targets:
y [%
] Training set97%
94%(60,000)Trained
with 5 000
urac
y
90 Linear, unbounded, symmetricTest set
Trained with 60,000 images
(5,000)5,000 images
d ac
cu symmetric
Fully bidirectionalLinear, bounded, symmetric
Test set
80
ated
Linear, bounded, symmetric
imul
Dynamic range(# f l d d t f G t G )S 70
5 10 100 200 50020 50
(# of pulses needed to move from Gmin to Gmax)
Fig. 5. When the dynamic range of the linear response is large, theclassification accuracy can now reach that of the original network (a testaccuracy of 94% when trained with 5,000 images; of 97% when trained withall 60,000 images) [2].
26.5x
1.9x5.9x 8.2x
20.3x
0 15x 0 49x 0 89x 1.79x 3.58xTrainingtime 0.15x 0.49x 0.89x
1msec300usec
time
PCM (conservative)(per example)
10usec30usec100usec GPU PCM (aggressive)
#1 #2 #3 #4 #5Network: #1 #2 #3 #4 #57 layers 7 layers 4 layers 7 layers 4 layers
7.7e6 synapses 36e6 synapses 52e6 synapses 450e6 synapses 485e6 synapses51 GB/sec (24%) 84 GB/sec (34%) 99 GB/sec (40%) 250 GB/sec (100%) 250 GB/sec (100%)
Network:
100W10W1W PCM ( i )
GPU768 GFLOPS (14%) 1136 GFLOPS (25%) 1447 GFLOPS (32%) 4,591 GFLOPS (100%) 4,591 GFLOPS (100%)
34, 200x 13, 600x 5, 700x 890x2,850x 1,130x 620x
1W100mW10mW
Training
PCM (aggressive)PCM (conservative)
2 470x34, 200x 13, 600x 5, 700x 890x, 620x 220x 120xg
power 2, 470x
Fig. 6. Predicted training time (per ANN example) and power for 5ANNs, ranging from 0.2GB to nearly 6GB [4]. Under moderately-aggressiveassumptions for parallel–read and –write speed, PCM-based on-chip machinelearning can offer lower power and faster training for both large and smallnetworks [4].
“jump-table.” For backpropagation training, where one or morecopies of the same programming pulse are applied to the NVMfor adjusting the weights [1], we simply need one jump-tablefor potentiation (SET) and one for depression (RESET).
With a pair of such jump-tables, we can capture the nonlin-earity of conductance response as a function of conductance(e.g., the same pulse might create a large “jump” at lowconductance, but a much smaller jump at high conductance),the asymmetry between positive (SET) and negative (RESET)conductance changes, and the inherent stochastic nature ofeach jump. Fig. 7(a) plots median conductance change forpotentiation (blue) together with the ±1σ stochastic variationabout this median change (red). Fig. 7(b) shows the jump-table that fully captures this conductance response, plottingthe cumulative probability (in color, from 0 to 100%) of anyconductance change ∆G at any given initial conductance G.This table is ideal for computer simulation because a randomnumber r (uniform deviate, between 0.0 and 1.0) can beconverted to a resulting ∆G produced by a single pulse byscanning along the row associated with the conductance G(of the device before the pulse is applied) to find the point atwhich the table entry just exceeds r.
III. IMPACT OF NONLINEAR CONDUCTANCE RESPONSE
We have previously used a measured jump-table to sim-ulate the SET response of PCM devices [1], and are cur-rently exploring the use of similarly measured jump-tablesfor PCMO. In order to develop an intuitive understandingof the impact that various features of such jump-tables haveon the classification performance in the ANN application, westudy various artificially-constructed jump-tables. Except forthe specific jump-tables, these simulations are identical tothose performed in Ref [1], spanning 20 epochs.
The first question we address is the impact of asymmetryin conductance response. Here we assume both conductanceresponses are linear (Fig. 8(a)), but RESET conductanceresponse is much steeper than SET, so that the stepsize ofthe depression (RESET) jump-table is increased (Fig. 8(b)).As shown by the solid curves with filled symbols in Fig. 8(c),even a small degree of asymmetry can cause classificationaccuracy to fall steeply. However, each downstream neuron hasknowledge of the sign of the backpropagated correction, δ, and
Co
nd
uct
ance
G[a
.u.]
# of pulses0 100 200 300 400
Change in conductance DG [a.u.]
Co
nd
uct
ance
G[a
.u.]a) b)
300
400
500
600
700
800
900
1000
1100
1200
0 2 8 12 16 204 6 10 14 18
300
400
500
600
700
800
900
1000
1100
1200
Fig. 7. (a) Example median (blue) and ±1σ (red) conductance responsefor potentiation. (b) associated jump-table that fully captures this (artificiallyconstructed in this case) conductance response, with cumulative probabilityplotted in color (from 0 to 100%) of any conductance change ∆G at anygiven initial conductance G.
Change in conductance DG [a.u.]
160
65
70
75
80
85
90
95
100
0 10 20 30 40 5060
65
70
75
80
85
90
95
100
3 10 30
Test
Training
Test
Training
Ratio of | DGRESET / DGSET |
Ratio of DGRESET / DGSET
Corrected
Uncorrected
CorrectedUncorrected
Cla
ssif
icati
on
accu
racy
b)
c)
0 5 10 15 20 25
200
300
400
500
600
700
800
900
1000
1100
1200
Co
nd
uct
ance
G[a
.u.]
# of pulses
a)
Co
nd
uct
ance
G[a
.u.]
Ratio of| DGRESET / DGSET |
1
2
3
5
12
35
0
Fig. 8. (a) For a set of constructed linear conductance responses where thedepression (RESET, magenta) response is steeper than the base potentiation(SET, green) response, the (b) resulting jump-table shows larger (but constant)steps for RESET. (For clarity, only median response is shown.) (c) Althougheven a small SET/RESET asymmetry causes performance to fall off steeply(solid curves with filled symbols), the downstream neuron can partiallycompensate for this asymmetry by firing fewer RESET pulses (or more SETpulses). Inset shows same data plotted on a linear horizontal scale.
10070
75
80
85
90
95
100
80 20 0
Relative extent of linear regionC
lassif
icati
on
accu
racy
Co
nd
uct
ance
G[a
.u.]
# of pulsesC
on
du
ctan
ceG
[a.u
.]
Test
Training
b)
c)
a)
0 5 10 15 20 25 30 35 40 45 500
100
200
300
400
500
600
700
800
900
1000
60 40
RESET
Relative extent of linear region
SET
75%50%
25%
100%
75%
50%
25%
100%
0Change in conductance DG [a.u.]
Fig. 9. Impact of relative extent of linear region on neural networkperformance. (RESET conductance response remains linear at all times). a)Conductance vs. number of pulses, b) hypothetical jump tables studied, andc) impact on training and test accuracy. A substantial non-linear conductanceregion (up to ∼50%) can be accommodated without loss in applicationperformance.
thus knows whether it is attempting a SET or RESET. Thisimplies that asymmetry can be partly offset by “correcting”a steeper RESET response by firing commensurately fewerRESET pulses (or more SET pulses). As shown by the dottedcurves with open symbols in Fig. 8(c), this markedly expandsthe asymmetry that could potentially be accommodated.
Fig. 9 examines jump-tables that incorporate some degreeof initial non-linearity in the SET conductance response(Fig. 9(a)). The relative extent of the linear region is variedfrom 100% (fully linear) down to near 0% (fully nonlinear).For this and all subsequent studies, we assume that RESEToperations remain perfectly linear and symmetric to SET(Fig. 9(b)). We find that a substantial non-linear conductanceregion (up to ∼50%) can be accommodated without a signif-icant drop-off in the neural network performance (Fig. 9(c)).
Fig. 10 examines the impact of the strength of this initialnon-linearity on the neural network performance. In theseexperiments, a stronger (weaker) non-linearity implies fewer(more) steps to traverse the extent of the non-linear region(representing 25% of the total conductance range, Fig. 10(a)).
0 5 10 15 20 25 30 35 40 45 50200
300
400
500
600
700
800
900
1000
1100
1200
0
92
94
96
98
0.2 0.4 1.0
Ratio of min(DGSET) / max(DGSET)
Cla
ssif
icati
on
accu
rac
y
Co
nd
uct
ance
G[a
.u.]
# of pulses
Test
Training
a)
0.6 0.8
Co
nd
uct
ance
G[a
.u.]
b)
0.5 0.20.11.0
Ratio of min(DGSET)/max(DGSET)
0.50.20.1 1.0
c)
0Change in conductance DG [a.u.]
Fig. 10. Impact of the strength of an initial non-linearity on neural networkperformance. a) Conductance vs. number of pulses, b) hypothetical jumptables studied, and c) impact on training and test accuracy. Strength of aninitial non-linearity does not impact test classification accuracy, so long as asufficiently large linear region is available.
0 10 20 30 40 500
100
200
300
400
500
600
700
800
900
1000
0.1
70
75
80
85
90
95
100
Cla
ssif
icati
on
accu
rac
y
Co
nd
uct
ance
G[a
.u.]
# of pulses
Co
nd
uct
ance
G[a
.u.]
Test
Trainingc)
a)
Ratio of min(DGSET) / max(DGSET)0 0.2 0.4 1.00.6 0.8
Ratio of min(DGSET) /max(DGSET)
0.50.2
1.0
b)
0.5 0.2 0.11.0
0Change in conductance DG [a.u.]
Fig. 11. Impact of fully non-linear conductance response. a) Conductancevs. number of pulses, b) hypothetical jump tables studied, and c) impact ontraining and test accuracy. Even in the absence of a linear region it is possibleto achieve high performance – however, the ratio of minimum to maximumconductance change needs to be sufficiently large (>0.5).
The strength is defined as the ratio between the size of thefinal (minimum) conductance jump and the initial (maximum)conductance jump (Fig. 10(b)). Again, we find that the strengthof the non-linearity has little impact on the test accuracy(Fig. 10(c)), so long as the linear region is sufficiently large.
We also investigate fully non-linear conductance responsesof varying strengths (Figs. 11(a) and (b)). We find that it is stillpossible to achieve high classification accuracies (Fig. 11(c)),so long as the ratio of the minimum to maximum conductancejumps is >0.5. However, larger non-linearities cause a markeddrop-off in network performance, as a large portion of thedynamic range can be used up by just a few training pulses.
IV. CONCLUSION
We have assessed the impact of the conductance response ofNon-Volatile Memory (NVM) devices employed as the synap-tic weight element for on-chip acceleration of the trainingof large-scale artificial neural networks (ANN). We brieflyreviewed our previous work towards achieving competitiveperformance (classification accuracies) for such ANN withboth Phase-Change Memory [1], [2] and non-filamentary
ReRAM based on PrCaMnO (PCMO) [3], and towards assess-ing the potential advantages for ML training over GPU–basedhardware in terms of speed (up to 25× faster) and power (from120–2850× lower power) [4]. We discussed the “jump-table”concept, previously introduced to model real-world NVMsuch as PCM [1] or PCMO, to describe the full cumulativedistribution function (CDF) of resulting conductance-change ateach possible conductance value, for both potentiation (SET)and depression (RESET).
Using various artificially–constructed jump-tables, we as-sessed the relative importance of deviations from an idealNVM with a linear conductance response. While even a smallSET/RESET asymmetry between otherwise linear conduc-tance responses can cause performance to fall off steeply,downstream neurons can partially compensate for this asym-metry by firing fewer RESET pulses (or more SET pulses),allowing reasonable performance even in the presence of a sig-nificant asymmetry. We also found that a substantial non-linearconductance region (up to ∼50%) can be accommodated, andthat the strength of this initial non-linearity (ratio of minimumto maximum conductance change) can be significant, so longas a sufficiently large linear region is available. Even with fullynonlinear responses, it is possible to achieve high performanceso long as the ratio of minimum to maximum conductancechange is sufficiently close to unity (>0.5).
While the ‘LG’ algorithm, together with other approaches,should help a nonlinear, asymmetric NVM (such as PCM) actmore like an ideal linear, bidirectional NVM, the identifica-tion of NVM devices and/or pulse-schemes that can offer aconductance response that is at least partly linear will helpsignificantly in achieving high classification accuracies.
REFERENCES
[1] G. W. Burr, R. M. Shelby, C. di Nolfo, J. W. Jang, R. S. Shenoy,P. Narayanan, K. Virwani, E. U. Giacometti, B. Kurdi, and H. Hwang,“Experimental demonstration and tolerancing of a large-scale neuralnetwork (165,000 synapses), using phase-change memory as the synapticweight element,” in IEDM, 2014, p. 29.5.
[2] G. W. Burr, R. M. Shelby, S. Sidler, C. di Nolfo, J. Jang, I. Boybat,R. S. Shenoy, P. Narayanan, K. Virwani, E. U. Giacometti, B. Kurdi, andH. Hwang, “Experimental demonstration and tolerancing of a large–scaleneural network (165,000 synapses), using phase–change memory as thesynaptic weight element,” IEEE Trans. Electr. Dev., vol. 62, no. 11, pp.3498–3507, 2015.
[3] J.-W. Jang, S. Park, G. W. Burr, H. Hwang, and Y.-H. Jeong, “Optimiza-tion of conductance change in Pr1−xCaxMnO3–based synaptic devicesfor neuromorphic systems,” IEEE Electron Device Letters, vol. 36, no. 5,pp. 457–459, 2015.
[4] G. W. Burr, P.Narayanan, R. M. Shelby, S. Sidler, I. Boybat, C. diNolfo, and Y. Leblebici, “Large–scale neural networks implementedwith nonvolatile memory as the synaptic weight element: comparativeperformance analysis (accuracy, speed, and power),” in IEDM TechnicalDigest, 2015, p. 4.4.
[5] D. Rumelhart, G. E. Hinton, and J. L. McClelland, “A general frameworkfor parallel distributed processing,” in Parallel Distributed Processing.MIT Press, 1986.
[6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, p. 2278, 1998.