nvm - ibm · nvm-based networks must achieve competitive performance levels (e.g., classiﬁcation...

Large-scale neural networks implemented withnon-volatile memory as the synaptic weight

element: impact of conductance responseSeverin Sidler∗‡, Irem Boybat∗‡, Robert M. Shelby∗, Pritish Narayanan∗, Junwoo Jang∗†,

Alessandro Fumarola∗‡, Kibong Moon†, Yusuf Leblebici‡, Hyunsang Hwang†, and Geoffrey W. Burr∗

∗IBM Research–Almaden, 650 Harry Road, San Jose, CA 95120, Tel: (408) 927–1512, Email: [email protected]†Department of Material Science and Engineering, Pohang University of Science and Technology, Pohang 790-784, Korea

‡EPFL, Lausanne, CH–1015 Switzerland

Abstract—We assess the impact of the conductance responseof Non-Volatile Memory (NVM) devices employed as the synap-tic weight element for on-chip acceleration of the training oflarge-scale artificial neural networks (ANN). We briefly reviewour previous work towards achieving competitive performance(classification accuracies) for such ANN with both Phase-ChangeMemory (PCM) [1], [2] and non-filamentary ReRAM basedon PrCaMnO (PCMO) [3], and towards assessing the potentialadvantages for ML training over GPU–based hardware in termsof speed (up to 25× faster) and power (from 120–2850× lowerpower) [4]. We then discuss the “jump-table” concept, previouslyintroduced to model real-world NVM such as PCM [1] or PCMO,to describe the full cumulative distribution function (CDF) ofconductance-change at each device conductance value, for bothpotentiation (SET) and depression (RESET). Using several typesof artificially–constructed jump-tables, we assess the relativeimportance of deviations from an ideal NVM with perfectly linearconductance response.

I. INTRODUCTION

By performing computation at the location of data, non-VonNeumann (non–VN) computing ought to provide significantpower and speed benefits (Fig. 1) on specific and assumablyimportant tasks. For one such non–VN approach — on-chiptraining of large-scale ANN using NVM-based synapses [1]–[4] — viability will require several things. First, despite theinherent imperfections of NVM devices such as Phase ChangeMemory (PCM) [1], [2] or Resistive RAM (RRAM) [3], suchNVM-based networks must achieve competitive performancelevels (e.g., classification accuracies) when compared to ANN

CPU Memory

BUS

Von Neumann“Bottleneck”(a) (b)

Fig. 1. In the Von Neumann architecture (a), data (both operations andoperands) must move to and from the dedicated Central Processing Unit (CPU)along a bus. In contrast, in a Non–Von Neumann architecture, distributedcomputations take place at the location of the data, reducing the time andenergy spent moving data around [1].

trained using CPUs or GPUs. Second, the benefits of perform-ing computation at the data (Fig. 2) must confer a decidedadvantage in either training power or speed (or preferably,both). And finally, any on-chip accelerator should be appli-cable towards networks of different types (fully–connected“Deep” NN or Convolutional NN) and/or be reconfigurablefor networks of different shapes (wide with many neurons, ordeep with many layers).

We briefly review our work [1]–[4] in assessing the accu-racy, speed and power potential of on-chip NVM–based ML.

A. Potential for competitive classification accuracies

Using 2 phase-change memory (PCM) devices per synapse,we demonstrated a 3–layer perceptron with 164,885 synapses[1], trained with backpropagation [5] on a subset (5000examples) of the MNIST database of handwritten digits [6](Fig 3), using a modified weight-update rule compatiblewith NVM+selector crossbar arrays [1]. We proved that thismodification does not degrade the high “test” (generalization)accuracies such a 3–layer network inherently delivers on thisproblem when trained in software [1]. However, nonlinearityand asymmetry in PCM conductance response limited both“training” and “test” accuracy in these initial experiments to82–83% [1] (Fig. 4).

Selector deviceSynaptic weight

NVMN1

N2

N1

M1

M2

P

P1

O1

pairsConductance

1

N2

P2

O2

G+ G–

Nn

Oo

+ ‐ + ‐Nn Mm

MmMm PpM1

+

Fig. 2. Neuro-inspired non-Von Neumann computing [1]–[4], in whichneurons activate each other through dense networks of programmable synapticweights, can be implemented using dense crossbar arrays of nonvolatilememory (NVM) and selector device-pairs [1].

Asymmetry (between the gentle conductance increases ofPCM partial–SET and the abruptness of PCM RESET) wasmitigated by an occasional RESET strategy, which could beboth infrequent and inaccurate [1]. While in these initialexperiments, network parameters such as learning rate η hadto be tuned very carefully, a modified ‘LG’ algorithm offeredwider tolerance to η, higher classification accuracies, andlower training energy [4].

Tolerancing results showed that all NVM-based ANN canbe expected to be highly resilient to random effects (NVMvariability, yield, and stochasticity), but highly sensitive to“gradient” effects that act to steer all synaptic weights[1]. We showed that a bidirectional NVM with a symmetric,linear conductance response of finite but large dynamic range(e.g., each conductance step is relatively small) can deliver thesame high classification accuracies on the MNIST digits asa conventional, software-based implementation (Fig. 5). Onekey observation is the importance of avoiding constraints onweight magnitude that arise when the two conductances areeither both small or both large — e.g., synapses should remainin the center stripe of the “G-diamond” [2].

In this paper, we extend upon this observation to explorethe impact of specific deviations from such an idealized linear

Fig. 3. In forward evaluation of a multilayer perceptron, each layer’s neuronsdrive the next layer through weights wij and a nonlinearity f(). Input neuronsare driven by input (for instance, pixels from successive MNIST images(cropped to 22×24)); the 10 output neurons classify which digit was presented[1].

90

100

%] 500 x 661 PCM = (2 PCM/synapse * 164,885 synapses) + 730 unused PCM

80

90

acy

[%

Experiment

60

70

accu

r

8090

100 Matched simulation

40

50

ntal

a

506070

20

30

rim

en

Map offinal 10

203040

10

20

Training epochExpe

finalPCM 0 5 10 15 200

10

conductances (5000 imageseach)

0 1 20

g p each)

Fig. 4. Training accuracy for a 3–layer perceptron of 164,885 hardware-synapses [1], with all weight operations taking place on a 500 × 661 array ofmushroom-cell PCM devices. Also shown is a matched computer simulationof this NN, using parameters extracted from the experiment [1].

conductance response.

B. Comparative analysis of speed and power

We have also assessed the potential advantages, in termsof speed and power, of on-chip machine learning (ML)of large-scale artificial neural networks (ANN) using Non-Volatile Memory (NVM)-based synapses, in comparison toconventional GPU–based hardware [4].

Under moderately-aggressive assumptions for parallel–readand –write speed, PCM-based on-chip machine learning canpotentially offer lower power and faster training (per ANNexample) than GPU-based training for both large and smallnetworks (Fig. 6), even with the time and energy required foroccasional RESET (forced by the large asymmetry betweengentle partial-SET and abrupt RESET in PCM). Critical hereis the design of area-efficient read/write circuitry, so that manycopies of this circuitry operate in parallel (each handling asmall number of columns (rows), cs).

II. JUMP-TABLE CONCEPT

A useful concept in modeling the behavior of real NVMdevices for neuromorphic applications is the concept of a

100Training set Targets:

y [%

] Training set97%

94%(60,000)Trained

with 5 000

urac

y

90 Linear, unbounded, symmetricTest set

Trained with 60,000 images

(5,000)5,000 images

d ac

cu symmetric

Fully bidirectionalLinear, bounded, symmetric

Test set

80

ated

Linear, bounded, symmetric

imul

Dynamic range(# f l d d t f G t G )S 70

5 10 100 200 50020 50

(# of pulses needed to move from Gmin to Gmax)

Fig. 5. When the dynamic range of the linear response is large, theclassification accuracy can now reach that of the original network (a testaccuracy of 94% when trained with 5,000 images; of 97% when trained withall 60,000 images) [2].

26.5x

1.9x5.9x 8.2x

20.3x

0 15x 0 49x 0 89x 1.79x 3.58xTrainingtime 0.15x 0.49x 0.89x

1msec300usec

time

PCM (conservative)(per example)

10usec30usec100usec GPU PCM (aggressive)

#1 #2 #3 #4 #5Network: #1 #2 #3 #4 #57 layers 7 layers 4 layers 7 layers 4 layers

7.7e6 synapses 36e6 synapses 52e6 synapses 450e6 synapses 485e6 synapses51 GB/sec (24%) 84 GB/sec (34%) 99 GB/sec (40%) 250 GB/sec (100%) 250 GB/sec (100%)

Network:

100W10W1W PCM ( i )

GPU768 GFLOPS (14%) 1136 GFLOPS (25%) 1447 GFLOPS (32%) 4,591 GFLOPS (100%) 4,591 GFLOPS (100%)

34, 200x 13, 600x 5, 700x 890x2,850x 1,130x 620x

1W100mW10mW

Training

PCM (aggressive)PCM (conservative)

2 470x34, 200x 13, 600x 5, 700x 890x, 620x 220x 120xg

power 2, 470x

Fig. 6. Predicted training time (per ANN example) and power for 5ANNs, ranging from 0.2GB to nearly 6GB [4]. Under moderately-aggressiveassumptions for parallel–read and –write speed, PCM-based on-chip machinelearning can offer lower power and faster training for both large and smallnetworks [4].

“jump-table.” For backpropagation training, where one or morecopies of the same programming pulse are applied to the NVMfor adjusting the weights [1], we simply need one jump-tablefor potentiation (SET) and one for depression (RESET).

With a pair of such jump-tables, we can capture the nonlin-earity of conductance response as a function of conductance(e.g., the same pulse might create a large “jump” at lowconductance, but a much smaller jump at high conductance),the asymmetry between positive (SET) and negative (RESET)conductance changes, and the inherent stochastic nature ofeach jump. Fig. 7(a) plots median conductance change forpotentiation (blue) together with the ±1σ stochastic variationabout this median change (red). Fig. 7(b) shows the jump-table that fully captures this conductance response, plottingthe cumulative probability (in color, from 0 to 100%) of anyconductance change ∆G at any given initial conductance G.This table is ideal for computer simulation because a randomnumber r (uniform deviate, between 0.0 and 1.0) can beconverted to a resulting ∆G produced by a single pulse byscanning along the row associated with the conductance G(of the device before the pulse is applied) to find the point atwhich the table entry just exceeds r.

III. IMPACT OF NONLINEAR CONDUCTANCE RESPONSE

We have previously used a measured jump-table to sim-ulate the SET response of PCM devices [1], and are cur-rently exploring the use of similarly measured jump-tablesfor PCMO. In order to develop an intuitive understandingof the impact that various features of such jump-tables haveon the classification performance in the ANN application, westudy various artificially-constructed jump-tables. Except forthe specific jump-tables, these simulations are identical tothose performed in Ref [1], spanning 20 epochs.

The first question we address is the impact of asymmetryin conductance response. Here we assume both conductanceresponses are linear (Fig. 8(a)), but RESET conductanceresponse is much steeper than SET, so that the stepsize ofthe depression (RESET) jump-table is increased (Fig. 8(b)).As shown by the solid curves with filled symbols in Fig. 8(c),even a small degree of asymmetry can cause classificationaccuracy to fall steeply. However, each downstream neuron hasknowledge of the sign of the backpropagated correction, δ, and

Co

nd

uct

ance

G[a

.u.]

# of pulses0 100 200 300 400

Change in conductance DG [a.u.]

Co

nd

uct

ance

G[a

.u.]a) b)

300

400

500

600

700

800

900

1000

1100

1200

0 2 8 12 16 204 6 10 14 18

300

400

500

600

700

800

900

1000

1100

1200

Fig. 7. (a) Example median (blue) and ±1σ (red) conductance responsefor potentiation. (b) associated jump-table that fully captures this (artificiallyconstructed in this case) conductance response, with cumulative probabilityplotted in color (from 0 to 100%) of any conductance change ∆G at anygiven initial conductance G.

Change in conductance DG [a.u.]

160

65

70

75

80

85

90

95

100

0 10 20 30 40 5060

65

70

75

80

85

90

95

100

3 10 30

Test

Training

Test

Training

Ratio of | DGRESET / DGSET |

Ratio of DGRESET / DGSET

Corrected

Uncorrected

CorrectedUncorrected

Cla

ssif

icati

on

accu

racy

b)

c)

0 5 10 15 20 25

200

300

400

500

600

700

800

900

1000

1100

1200

Co

nd

uct

ance

G[a

.u.]

# of pulses

a)

Co

nd

uct

ance

G[a

.u.]

Ratio of| DGRESET / DGSET |

1

2

3

5

12

35

0

Fig. 8. (a) For a set of constructed linear conductance responses where thedepression (RESET, magenta) response is steeper than the base potentiation(SET, green) response, the (b) resulting jump-table shows larger (but constant)steps for RESET. (For clarity, only median response is shown.) (c) Althougheven a small SET/RESET asymmetry causes performance to fall off steeply(solid curves with filled symbols), the downstream neuron can partiallycompensate for this asymmetry by firing fewer RESET pulses (or more SETpulses). Inset shows same data plotted on a linear horizontal scale.

10070

75

80

85

90

95

100

80 20 0

Relative extent of linear regionC

lassif

icati

on

accu

racy

Co

nd

uct

ance

G[a

.u.]

# of pulsesC

on

du

ctan

ceG

[a.u

.]

Test

Training

b)

c)

a)

0 5 10 15 20 25 30 35 40 45 500

100

200

300

400

500

600

700

800

900

1000

60 40

RESET

Relative extent of linear region

SET

75%50%

25%

100%

75%

50%

25%

100%

0Change in conductance DG [a.u.]

Fig. 9. Impact of relative extent of linear region on neural networkperformance. (RESET conductance response remains linear at all times). a)Conductance vs. number of pulses, b) hypothetical jump tables studied, andc) impact on training and test accuracy. A substantial non-linear conductanceregion (up to ∼50%) can be accommodated without loss in applicationperformance.

thus knows whether it is attempting a SET or RESET. Thisimplies that asymmetry can be partly offset by “correcting”a steeper RESET response by firing commensurately fewerRESET pulses (or more SET pulses). As shown by the dottedcurves with open symbols in Fig. 8(c), this markedly expandsthe asymmetry that could potentially be accommodated.

Fig. 9 examines jump-tables that incorporate some degreeof initial non-linearity in the SET conductance response(Fig. 9(a)). The relative extent of the linear region is variedfrom 100% (fully linear) down to near 0% (fully nonlinear).For this and all subsequent studies, we assume that RESEToperations remain perfectly linear and symmetric to SET(Fig. 9(b)). We find that a substantial non-linear conductanceregion (up to ∼50%) can be accommodated without a signif-icant drop-off in the neural network performance (Fig. 9(c)).

Fig. 10 examines the impact of the strength of this initialnon-linearity on the neural network performance. In theseexperiments, a stronger (weaker) non-linearity implies fewer(more) steps to traverse the extent of the non-linear region(representing 25% of the total conductance range, Fig. 10(a)).

0 5 10 15 20 25 30 35 40 45 50200

300

400

500

600

700

800

900

1000

1100

1200

0

92

94

96

98

0.2 0.4 1.0

Ratio of min(DGSET) / max(DGSET)

Cla

ssif

icati

on

accu

rac

y

Co

nd

uct

ance

G[a

.u.]

# of pulses

Test

Training

a)

0.6 0.8

Co

nd

uct

ance

G[a

.u.]

b)

0.5 0.20.11.0

Ratio of min(DGSET)/max(DGSET)

0.50.20.1 1.0

c)


Fig. 10. Impact of the strength of an initial non-linearity on neural networkperformance. a) Conductance vs. number of pulses, b) hypothetical jumptables studied, and c) impact on training and test accuracy. Strength of aninitial non-linearity does not impact test classification accuracy, so long as asufficiently large linear region is available.

0 10 20 30 40 500

100

200

300

400

500

600

700

800

900

1000

0.1

70

75

80

85

90

95

100

Cla

ssif

icati

on

accu

rac

y

Co

nd

uct

ance

G[a

.u.]

# of pulses

Co

nd

uct

ance

G[a

.u.]

Test

Trainingc)

a)

Ratio of min(DGSET) / max(DGSET)0 0.2 0.4 1.00.6 0.8

Ratio of min(DGSET) /max(DGSET)

0.50.2

1.0

b)

0.5 0.2 0.11.0


Fig. 11. Impact of fully non-linear conductance response. a) Conductancevs. number of pulses, b) hypothetical jump tables studied, and c) impact ontraining and test accuracy. Even in the absence of a linear region it is possibleto achieve high performance – however, the ratio of minimum to maximumconductance change needs to be sufficiently large (>0.5).

The strength is defined as the ratio between the size of thefinal (minimum) conductance jump and the initial (maximum)conductance jump (Fig. 10(b)). Again, we find that the strengthof the non-linearity has little impact on the test accuracy(Fig. 10(c)), so long as the linear region is sufficiently large.

We also investigate fully non-linear conductance responsesof varying strengths (Figs. 11(a) and (b)). We find that it is stillpossible to achieve high classification accuracies (Fig. 11(c)),so long as the ratio of the minimum to maximum conductancejumps is >0.5. However, larger non-linearities cause a markeddrop-off in network performance, as a large portion of thedynamic range can be used up by just a few training pulses.

IV. CONCLUSION

We have assessed the impact of the conductance response ofNon-Volatile Memory (NVM) devices employed as the synap-tic weight element for on-chip acceleration of the trainingof large-scale artificial neural networks (ANN). We brieflyreviewed our previous work towards achieving competitiveperformance (classification accuracies) for such ANN withboth Phase-Change Memory [1], [2] and non-filamentary

ReRAM based on PrCaMnO (PCMO) [3], and towards assess-ing the potential advantages for ML training over GPU–basedhardware in terms of speed (up to 25× faster) and power (from120–2850× lower power) [4]. We discussed the “jump-table”concept, previously introduced to model real-world NVMsuch as PCM [1] or PCMO, to describe the full cumulativedistribution function (CDF) of resulting conductance-change ateach possible conductance value, for both potentiation (SET)and depression (RESET).

Using various artificially–constructed jump-tables, we as-sessed the relative importance of deviations from an idealNVM with a linear conductance response. While even a smallSET/RESET asymmetry between otherwise linear conduc-tance responses can cause performance to fall off steeply,downstream neurons can partially compensate for this asym-metry by firing fewer RESET pulses (or more SET pulses),allowing reasonable performance even in the presence of a sig-nificant asymmetry. We also found that a substantial non-linearconductance region (up to ∼50%) can be accommodated, andthat the strength of this initial non-linearity (ratio of minimumto maximum conductance change) can be significant, so longas a sufficiently large linear region is available. Even with fullynonlinear responses, it is possible to achieve high performanceso long as the ratio of minimum to maximum conductancechange is sufficiently close to unity (>0.5).

While the ‘LG’ algorithm, together with other approaches,should help a nonlinear, asymmetric NVM (such as PCM) actmore like an ideal linear, bidirectional NVM, the identifica-tion of NVM devices and/or pulse-schemes that can offer aconductance response that is at least partly linear will helpsignificantly in achieving high classification accuracies.

REFERENCES

[1] G. W. Burr, R. M. Shelby, C. di Nolfo, J. W. Jang, R. S. Shenoy,P. Narayanan, K. Virwani, E. U. Giacometti, B. Kurdi, and H. Hwang,“Experimental demonstration and tolerancing of a large-scale neuralnetwork (165,000 synapses), using phase-change memory as the synapticweight element,” in IEDM, 2014, p. 29.5.

[2] G. W. Burr, R. M. Shelby, S. Sidler, C. di Nolfo, J. Jang, I. Boybat,R. S. Shenoy, P. Narayanan, K. Virwani, E. U. Giacometti, B. Kurdi, andH. Hwang, “Experimental demonstration and tolerancing of a large–scaleneural network (165,000 synapses), using phase–change memory as thesynaptic weight element,” IEEE Trans. Electr. Dev., vol. 62, no. 11, pp.3498–3507, 2015.

[3] J.-W. Jang, S. Park, G. W. Burr, H. Hwang, and Y.-H. Jeong, “Optimiza-tion of conductance change in Pr1−xCaxMnO3–based synaptic devicesfor neuromorphic systems,” IEEE Electron Device Letters, vol. 36, no. 5,pp. 457–459, 2015.

[4] G. W. Burr, P.Narayanan, R. M. Shelby, S. Sidler, I. Boybat, C. diNolfo, and Y. Leblebici, “Large–scale neural networks implementedwith nonvolatile memory as the synaptic weight element: comparativeperformance analysis (accuracy, speed, and power),” in IEDM TechnicalDigest, 2015, p. 4.4.

[5] D. Rumelhart, G. E. Hinton, and J. L. McClelland, “A general frameworkfor parallel distributed processing,” in Parallel Distributed Processing.MIT Press, 1986.

[6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, p. 2278, 1998.

nvm - ibm · nvm-based networks must achieve competitive performance levels (e.g., classiﬁcation...

Documents