memristor-based multilayer neural networks with online gradient descent training

13
1 Memristor-based multilayer neural networks with online gradient descent training Daniel Soudry, Dotan Di Castro, Asaf Gal, Avinoam Kolodny, and Shahar Kvatinsky Abstract—Learning in Multilayer Neural Networks (MNNs) relies on continuous updating of large matrices of synaptic weights by local rules. Such locality can be exploited for massive parallelism when implementing MNNs in hardware. However, these update rules require a multiply and accumulate opera- tion for each synaptic weight, which is challenging to implement compactly using CMOS. In this paper, a method for perform- ing these update operations simultaneously (“incremental outer products”) using memristor-based arrays is proposed. The method is based on the fact that, approximately, given a volt- age pulse, the conductivity of a memristor will increment pro- portionally to the pulse duration multiplied by the pulse magni- tude, if the increment is sufficiently small. The proposed method uses a synaptic circuit composed of a small number of compo- nents per synapse: one memristor and two CMOS transistors. This circuit is expected to consume between 2% to 8% of the area and static power of previous CMOS-only hardware alter- natives. Such a circuit can compactly implement scalable train- ing algorithms based on online gradient descent (e.g., Backprop- agation). The utility and robustness of the proposed memristor- based circuit is demonstrated on standard supervised learning tasks. Keywords— Memristor, Memristive systems, Stochastic Gradient Descent, Backpropagation, Synapse, Hardware, Multilayer neural networks. I. I NTRODUCTION M ULTILAYER Neural Networks (MNNs) have been recently incorporated into numerous com- mercial products and services such as mobile devices and cloud computing. For realistic large scale learn- ing tasks MNNs can perform impressively well and produce state-of-the-art results, when massive compu- tational power is available (e.g., [1–3], and see [4] for related press). However, such computational intensity limits their usability, due to area and power require- ments. New dedicated hardware design approach must therefore be developed to overcome these limitations. In fact, it was recently suggested that such new types of specialized hardware are essential for real progress towards building intelligent machines [5]. MNNs utilize large matrices of values termed synaptic weights. These matrices are continuously up- dated during the operation of the system, and are con- D. Soudry is with the Department of Statistics, Center for Theo- retical Neuroscience, Columbia University, New York, NY 10027, USA (email: [email protected]) D. Di Castro, A. Gal, A. Kolodny and S. Kvatinsky are with the Department of Electrical Engineering, Technion - Israel institute of Technology, Haifa 32000, Israel stantly being used to interpret new data. The power of MNNs mainly stems from the learning rules used for updating the weights. These rules are usually local, in the sense that they depend only on information avail- able at the site of the synapse. A canonical example for a local learning rule is the Backpropagation algo- rithm, an efficient implementation of online gradient descent, which is commonly used to train MNNs [6]. The locality of Backpropgation stems from the chain rule used to calculate the gradients. Similar locality appears in many other learning rules used to train neu- ral networks and various Machine Learning (ML) al- gorithms. Implementing ML algorithms such as Backpropaga- tion on conventional general-purpose digital hardware (i.e., von Neumann architecture) is highly inefficient. A prime reason for this is the physical separation be- tween the memory arrays used to store the values of the synaptic weights and the arithmetic module used to compute the update rules. General-purpose architec- ture actually eliminates the advantage of these learn- ing rules - their locality. This locality allows highly efficient parallel computation, as demonstrated by bi- ological brains. To overcome the inefficiency of general-purpose hardware, numerous dedicated hardware designs, based on CMOS technology, have been proposed in the past two decades [7, and references therein]. These designs perform online learning tasks in MNNs us- ing massively parallel synaptic arrays, where each synapse stores a synaptic weight and updates it locally. However, so far, these designs are not commonly used for practical large scale applications, and it is not clear whether they could be scaled up, since each synapse requires too much power and area (see section VII). This issue of scalability possibly casts doubt on the entire field [8]. Recently, it has been suggested [9, 10] that scalable hardware implementations of neural networks may be- come possible if a novel device, the memristor [11– 13], is used. A memristor is a resistor with a vary- ing, history-dependent resistance. It is a passive ana- log device with activation-dependent dynamics, which makes it ideal for registering and updating of synap- tic weights. Furthermore, its relatively small size en-

Upload: aserius

Post on 19-Feb-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Learning in Multilayer Neural Networks (MNNs)relies on continuous updating of large matrices of synapticweights by local rules. Such locality can be exploited for massiveparallelism when implementing MNNs in hardware. However,these update rules require a multiply and accumulate operationfor each synaptic weight, which is challenging to implementcompactly using CMOS. In this paper, a method for performingthese update operations simultaneously (“incremental outerproducts”) using memristor-based arrays is proposed. Themethod is based on the fact that, approximately, given a voltagepulse, the conductivity of a memristor will increment proportionallyto the pulse duration multiplied by the pulse magnitude,if the increment is sufficiently small. The proposed methoduses a synaptic circuit composed of a small number of componentsper synapse: one memristor and two CMOS transistors.This circuit is expected to consume between 2% to 8% of thearea and static power of previous CMOS-only hardware alternatives.Such a circuit can compactly implement scalable trainingalgorithms based on online gradient descent (e.g., Backpropagation).The utility and robustness of the proposed memristorbasedcircuit is demonstrated on standard supervised learningtasks.

TRANSCRIPT

Page 1: Memristor-based multilayer neural networks with online gradient descent training

1

Memristor-based multilayer neural networkswith online gradient descent training

Daniel Soudry, Dotan Di Castro, Asaf Gal, Avinoam Kolodny, and Shahar Kvatinsky

Abstract—Learning in Multilayer Neural Networks (MNNs)relies on continuous updating of large matrices of synapticweights by local rules. Such locality can be exploited for massiveparallelism when implementing MNNs in hardware. However,these update rules require a multiply and accumulate opera-tion for each synaptic weight, which is challenging to implementcompactly using CMOS. In this paper, a method for perform-ing these update operations simultaneously (“incremental outerproducts”) using memristor-based arrays is proposed. Themethod is based on the fact that, approximately, given a volt-age pulse, the conductivity of a memristor will increment pro-portionally to the pulse duration multiplied by the pulse magni-tude, if the increment is sufficiently small. The proposed methoduses a synaptic circuit composed of a small number of compo-nents per synapse: one memristor and two CMOS transistors.This circuit is expected to consume between 2% to 8% of thearea and static power of previous CMOS-only hardware alter-natives. Such a circuit can compactly implement scalable train-ing algorithms based on online gradient descent (e.g., Backprop-agation). The utility and robustness of the proposed memristor-based circuit is demonstrated on standard supervised learningtasks.

Keywords— Memristor, Memristive systems,Stochastic Gradient Descent, Backpropagation,Synapse, Hardware, Multilayer neural networks.

I. INTRODUCTION

MULTILAYER Neural Networks (MNNs) havebeen recently incorporated into numerous com-

mercial products and services such as mobile devicesand cloud computing. For realistic large scale learn-ing tasks MNNs can perform impressively well andproduce state-of-the-art results, when massive compu-tational power is available (e.g., [1–3], and see [4] forrelated press). However, such computational intensitylimits their usability, due to area and power require-ments. New dedicated hardware design approach musttherefore be developed to overcome these limitations.In fact, it was recently suggested that such new typesof specialized hardware are essential for real progresstowards building intelligent machines [5].

MNNs utilize large matrices of values termedsynaptic weights. These matrices are continuously up-dated during the operation of the system, and are con-

D. Soudry is with the Department of Statistics, Center for Theo-retical Neuroscience, Columbia University, New York, NY 10027,USA (email: [email protected])

D. Di Castro, A. Gal, A. Kolodny and S. Kvatinsky are with theDepartment of Electrical Engineering, Technion - Israel institute ofTechnology, Haifa 32000, Israel

stantly being used to interpret new data. The power ofMNNs mainly stems from the learning rules used forupdating the weights. These rules are usually local, inthe sense that they depend only on information avail-able at the site of the synapse. A canonical examplefor a local learning rule is the Backpropagation algo-rithm, an efficient implementation of online gradientdescent, which is commonly used to train MNNs [6].The locality of Backpropgation stems from the chainrule used to calculate the gradients. Similar localityappears in many other learning rules used to train neu-ral networks and various Machine Learning (ML) al-gorithms.

Implementing ML algorithms such as Backpropaga-tion on conventional general-purpose digital hardware(i.e., von Neumann architecture) is highly inefficient.A prime reason for this is the physical separation be-tween the memory arrays used to store the values ofthe synaptic weights and the arithmetic module used tocompute the update rules. General-purpose architec-ture actually eliminates the advantage of these learn-ing rules - their locality. This locality allows highlyefficient parallel computation, as demonstrated by bi-ological brains.

To overcome the inefficiency of general-purposehardware, numerous dedicated hardware designs,based on CMOS technology, have been proposed inthe past two decades [7, and references therein]. Thesedesigns perform online learning tasks in MNNs us-ing massively parallel synaptic arrays, where eachsynapse stores a synaptic weight and updates it locally.However, so far, these designs are not commonly usedfor practical large scale applications, and it is not clearwhether they could be scaled up, since each synapserequires too much power and area (see section VII).This issue of scalability possibly casts doubt on theentire field [8].

Recently, it has been suggested [9, 10] that scalablehardware implementations of neural networks may be-come possible if a novel device, the memristor [11–13], is used. A memristor is a resistor with a vary-ing, history-dependent resistance. It is a passive ana-log device with activation-dependent dynamics, whichmakes it ideal for registering and updating of synap-tic weights. Furthermore, its relatively small size en-

Page 2: Memristor-based multilayer neural networks with online gradient descent training

2

ables integration of memory with the computing cir-cuit [14] and allows a compact and efficient architec-ture for learning algorithms. Specifically, it was re-cently shown [15] that memristors could be used toimplement MNNs trained by backpropgation in a chip-in-a-loop setting (where the weight update calculationis performed using a host computer). However, it re-mained an open question whether the learning itself,which is a major computational bottleneck, could alsobe done online (completely implemented using hard-ware) with massively parallel memristor arrays.

To date, no circuit has been suggested to utilizememristors for implementing scalable learning sys-tems. Most previous implementations of learning rulesusing memristor arrays have been limited to spikingneurons and focused on Spike Time Dependent Plas-ticity (STDP) [16–23]. Applications of STDP areusually aimed to explain biological neuroscience re-sults. At this point, however, it is not clear how use-ful is STDP algorithmically. For example, the con-vergence of STDP-based learning is not guaranteedfor general inputs [24]. Other learning systems [25–27] that rely on memristor arrays, are limited to singlelayers with a few inputs per neuron. To the best ofthe authors’ knowledge, the learning rules used in theabove memristor-based designs are not yet competi-tive and have not been used for large scale problems- which is precisely where dedicated hardware is re-quired. In contrast, online gradient descent is consid-ered to be very effective in large scale problems [28].Also, there are guarantees that this optimization pro-cedure converges to some local minimum. Finally,it can achieve state-of-the-art-results when executedwith massive computational power (e.g., [3, 29]).

The main challenge for general synaptic array cir-cuit design arises from the nature of learning rules:practically all of them contain a multiplicative term[30], which is hard to implement in compact and scal-able hardware. In this paper, a novel and generalscheme to design hardware for online gradient descentlearning rules is presented. The proposed schemeuses a memristor as a memory element to store theweight and temporal encoding as a mechanism to per-form a multiplication operation. The proposed designuses a single memristor and two CMOS transistors persynapse, and requires therefore 2% to 8% of the areaand static power of previously proposed CMOS-onlycircuits. The functionality of a circuit utilizing thememristive synapse array is demonstrated numericallyon standard supervised learning tasks. On all datasetsthe circuit performs as well as the software algorithm.Introducing noise levels of about 10% and parame-

ter variability of about 30% only mildly reduced theperformance of the circuit, due to the inherent robust-ness of online gradient descent. The proposed designmay therefore allow the use of specialized hardwarefor MNNs, as well as other ML algorithms, rather thanthe currently used general-purpose architecture.

The remainder of the paper is organized as follows.In section II, basic background on memristors and on-line gradient descent learning is given. In section III,the proposed circuit is described, for efficient imple-mentation of Single layer Neural Networks (SNNs) inhardware. In section IV a modification of the proposedcircuit is used to implement a general MNN. In sec-tion V the sources of noise and variation in the cir-cuit are estimated. In section VI the circuit operationand learning capabilities are evaluated numerically,demonstrating it can be used to implement MNNstrainable with online gradient descent. In section VIIthe novelty of the proposed circuit is discussed, as wellas possible generalizations; and in section VIII the pa-per is summarized. The supplementary material [31]includes detailed circuit schematics, code, and an ap-pendix with additional technicalities.

II. PRELIMINARIES

For convenience, basic background information onmemristors and online gradient descent learning isgiven in this section. For simplicity, the second partis focused on a simple example of the Adaline algo-rithm - a linear SNN trained using Mean Square Error(MSE).

A. The memristor

A memristor [11, 12], originally proposed by Chuain 1971 as “the missing fourth fundamental element”,is a passive electrical element. Memristors are basi-cally resistors with varying resistance, where their re-sistance changes according to time integral of the cur-rent through the device, or alternatively, the integratedvoltage upon the device. In the “classical” represen-tation the conductance of a memristor G depends di-rectly on the integral over time of the voltage upon thedevice, sometimes referred to as “Flux”. Formally, amemristor obeys the following equations

i (t) = G (s (t)) v (t) , (1)s (t) = v (t) . (2)

A generalization of this model , which is called a mem-ristive system [32], was proposed in 1976 by Chua andKang. In memristive devices, s is a general state vari-able, rather than an integral of the voltage. Memristive

Page 3: Memristor-based multilayer neural networks with online gradient descent training

SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING3

devices are discussed in appendix A. In the followingsections it is assumed that the variations in the value ofs (t) are restricted to be small, so that G (s (t)) can belinearized around some point s∗ and the conductivityof the memristor is given, to first order, by

G (s (t)) = g + g · s (t) , (3)

where g = [dG (s) /ds]s=s∗ and g = G (s∗)− g · s∗.

B. Online gradient descent learning

The field of Machine Learning (ML) is dedicated tothe construction and study of systems that can learnfrom data. For example, consider the following super-vised learning task. Assume a learning system that op-erates on K discrete presentations of inputs (“trials”),indexed by k = 1, 2, . . . ,K. For brevity, the indexingof iteration number is sometimes suppressed, where itis clear from the context. On each trial k, the systemreceives empirical data, a pair of two real column vec-tors of sizes M and N : a pattern x(k) ∈ RM and adesired label d(k) ∈ RN , with all pairs sharing thesame desired relation d(k) = f

(x(k)

). The objective

of the system is to estimate (“learn”) the function f (·)using the empirical data.

As a simple example, suppose W is a tunable N ×M matrix of parameters, and consider the estimator

r(k) = W(k)x(k), (4)

orr(k)n =

∑m

W (k)nmx

(k)m . (5)

which is a SNN. The result of the estimator, r = Wx,should aim to predict the right desired labels d =f (x) for new unseen patterns x. To solve this prob-lem, W is tuned to minimize some measure of errorbetween the estimated and desired labels, over a K0-long subset of the empirical data, called the “trainingset” (for which k = 1, ...,K0). For example, if wedefine the error vector

y(k) , d(k) − r(k) , (6)

then a common measure is MSE

MSE ,K0∑k=1

∥∥∥y(k)∥∥∥2

. (7)

The performance of the resulting estimator is thentested over a different subset, called the “test set”(k = K0 + 1, ...,K).

As explained in the introduction, a reasonable itera-tive algorithm for minimizing objective (7) (i.e., updat-ing W, where initially W is arbitrarily chosen) is thefollowing online gradient descent (also called stochas-tic Gradient Descent ) iteration

W(k) = W(k−1) − 1

2η∇W

∥∥∥y(k)∥∥∥2

, (8)

where the 1/2 coefficient is written for mathematicalconvenience, η is the learning rate, a (usually small)positive constant, and at each iteration k a single em-pirical sample is chosen randomly and presented at theinput of the system. The gradient can be calculated bydifferentiation using the chain rule, (6) and (4). Defin-ing ∆W(k) , W(k) −W(k−1), and (·)> to be thetranspose operation, we obtain the outer product

∆W(k) = ηy(k)(x(k)

)>, (9)

orW (k)nm = W (k−1)

nm + ηx(k)m y(k)

n . (10)

Specifically, this update rule is called the “Adaline”algorithm [33], used in adaptive signal processing andcontrol [34]. The parameters of more complicated es-timators can also be similarly tuned (“trained”), usingonline gradient descent or similar methods. Specif-ically, MNNs (section (IV)) are commonly beingtrained using Backpropgation, which is an efficientform of online gradient descent [6]. Importantly, notethat the update rule in (10) is “local”- i.e., the change inthe synaptic weight W (k)

nm depends only on the relatedcomponents of input

(x

(k)m

)and error

(y

(k)n

). This

local update, which ubiquitously appears in neural net-works training (e.g., Backpropagation, and the Percep-tron learning rule [35]) and other ML algorithms (e.g.,[36–38]), enables a massively parallel hardware de-sign, as explained in section III.

Such massively parallel designs are needed, sincefor large N and M , learning systems usually becomecomputationally prohibitive in both time and mem-ory space. For example, in the simple Adaline al-gorithm the main computational burden, in each it-eration, comes from (5) and (10), where the numberof operations (addition and multiplication) is of or-der O (M ·N). Commonly, these steps have becomethe main computational bottleneck in executing MNNs(and related ML algorithms) in software. Other algo-rithmic steps, such as (6) here, are usually linear ineither M or N , and therefore have a negligible com-putational complexity.

Page 4: Memristor-based multilayer neural networks with online gradient descent training

4

Fig. 1 A simple Adaline learning task, with the proposed“Synaptic Grid” circuit executing (4) and (9), which are themain computational bottlenecks in the algorithm.

III. CIRCUIT DESIGN

Next, dedicated analog hardware for implementingonline gradient descent learning is described. For sim-plicity, this section is focused on simple linear SNNstrained using Adaline, as described in section II. Later,in section IV, the circuit is modified to implementgeneral MNNs, trained using Backpropagation. Thederivations are rigorously done using a single con-trolled approximation (13), and no heuristics (whichcan damage performance) - thus creating a preciselydefined mapping between a “mathematical” learningsystem and a hardware learning system. Readers whodo not wish to go through the detailed derivations, canget the general idea from the following overview (sec-tion A) together with Figs. 1-4.

A. Circuit overview

To implement these learning systems, a grid of “ar-tificial synapses” is constructed, where each synapsestores a single synaptic weight Wnm. The grid is alarge N ×M array of synapses, where the synapsesoperate simultaneously, each performing a simple lo-cal operation. This synaptic grid circuit carries themain computational load in ML algorithms by imple-menting the two computational bottlenecks, (5) and(10), in a massively parallel way. This matrix×vectorproduct in (5) is done using a resistive grid (of mem-ristors), implementing multiplication through Ohm’slaw and addition through current summation. Thevector×vector outer product in (10) is done by usingthe fact that, given a voltage pulse, the conductivity ofa memristor will increment proportionally to the pulseduration multiplied by the pulse magnitude. Using thismethod, multiplication requires only two transistorsper synapse. Thus, together with a negligible amountof O (M +N) additional operations, these arrays canbe used to construct efficient learning systems.

column input

interface

column input

interface

row output

interface

row input

interface

row output

interface

row input

interface

Fig. 2 Synaptic grid (N × M ) circuit architecture scheme.Every (n,m) node in the grid is a memristor-based synapsethat receives voltage input from the shared um, um and theen lines and outputs Inm current on the on lines. These out-put lines receive total current arriving from all the synapseson the n-th row, and are grounded.

Similarly to the Adaline algorithm described in sec-tion II, the circuit operates on discrete presentations ofinputs (“trials”) - see Fig. 1. On each trial k, the cir-cuit receives an input vector x(k) ∈ [−A, A]

M and anerror vector y(k) ∈ [−A, A]

N (whereA is a bound onboth the input and error) and produces a result outputvector r(k) ∈ RN , which depends on the input by therelation (4), where the matrix W(k) ∈ RN×M , calledthe synaptic weight matrix, is stored in the system.Additionally, on each step, the circuit updates W(k)

according to (9). This circuit can be used to implementML algorithms. For example, as depicted in Fig. 1 thesimple Adaline algorithm can be implemented usingthe circuit, with training enabled on k = 1, . . . ,K0.The implementation of the Backpropgation algorithm,is shown in section IV.

B. The circuit architecture

B.1 The synaptic grid

The synaptic grid system described in Fig. 1 is im-plemented by the circuit shown in 2, where compo-nents of all vectors are shown as individual signals.Each gray cell in Fig. 2 is a synaptic circuit (“arti-ficial synapse”) using a memristor (described in Fig.3a). The synapses are arranged in a two dimensionalN × M grid array as shown in Fig. 2, where eachsynapse is indexed by (n,m), with m ∈ 1..M andn ∈ 1..N. Each (n,m) synapse receives two in-puts um, um, an enable signal en, and produces anoutput current Inm. Each column of synapses in the

Page 5: Memristor-based multilayer neural networks with online gradient descent training

SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING5

array (them-th column) shares two vertical input linesum and um, both connected to a “column input inter-face”. The voltage signals um and um (∀m) are gen-erated by the column input interfaces from the compo-nents of the input signal x, upon presentation. Eachrow of synapses in the array (the n-th row) shares thehorizontal enable line en and output line on, whereen is connected to a “row input interface” and on isconnected to a “row output interface”. The voltage(pulse) signal on the enable line en (∀n) is generatedby the row input interfaces from the error signal y,upon presentation. The row output interfaces keep theo lines grounded (V = 0) and convert the total currentfrom all the synapses in the row going to the ground,∑m Inm, into the output signal r.

B.2 The artificial synapse

The proposed memristive synapse is composed ofa single memristor, connected to a shared terminalof two MOSFET transistors (P-type and N-type), asshown schematically in Fig. 3a (without the n,m in-dices). These terminals act as drain or source, inter-changeably, depending on the input, similarly to theCMOS transistors in transmission gates. Recall thememristor dynamics are given by (1-3), with s (t) be-ing the state variable of the memristor and G (s (t)) =g + gs (t) its conductivity. Also, the current of theN-type transistor in the linear region is, ideally,

I = K

((VGS − VT )VDS −

1

2V 2DS

), (11)

where VGS is the gate-source voltage, VT is the volt-age threshold, VDS is the drain-source voltage, and Kis the conduction parameter of the transistors. The cur-rent equation for the P-type transistor is symmetricalto (11), and for simplicity we assume that K and |VT |are equal for both transistors.

The synapse receives three voltage input signals: uand u = −u are connected, respectively, to a terminalof the N-type and P-type transistors and an enable sig-nal e is connected to the gate of both transistors. Theenable signal can have a value of 0, VDD, or −VDD(with VDD > |VT |) and have a pulse shape of vary-ing duration, as explained below. The output of thesynapse is a current I to the grounded line o. Themagnitude of the input signal u (t) and the circuit pa-rameters are set so they fulfill the following

1. We assume (somewhat unusually) that

|u (t)| < |VT | (12)

so that if e = 0 (i.e., the gate is grounded) bothtransistors are non-conducting (in the cut-off re-gion).

2. We assume that

K (VDD − |u (t)| − |VT |) G (s (t)) (13)

so if e = ±VDD, then, when in the linear region,both transistors have relatively high conductivityas compared to the conductivity of the memristor.

To satisfy (12) and (13), the proper value of u(t) ischosen. Note (13) is a reasonable assumption, asshown in [39]. If not (e.g., if the memristor conduc-tivity is very high), instead one can use an alternativedesign, as described in appendix B [31]. Under theseassumptions, when e = 0 then I = 0 in the output andthe voltage across the memristor is zero. In this case,the state variable does not change. If e = ±VDD, from(13), the voltage on the memristor is approximately±u . When e = VDD the N-type transistor is conduct-ing in the linear region while the P-type transistor iscut off. When e = −VDD the P-type transistor is con-ducting in the linear region while the N-type transistoris cut off.

C. Circuit operation

The operation of the circuit in each trial (a singlepresentation of a specific input) is composed of twophases. First, in the computing phase (“read”), theoutput current from all the synapses is summed andadjusted to produce an arithmetic operation r = Wxfrom (4). Second, in the updating phase (“write”), thesynaptic weights are incremented according to the up-date rule ∆W = ηyx> from (9). In the proposed de-sign, for each synapse, the synaptic weight, Wnm, isstored using snm, the memristor state variable of the(n,m) synapse. The parallel “read” and “write” op-erations are achieved by applying simultaneous volt-age signals on the inputs um and enable signals en(∀n,m). The signals and their effect on the state vari-able are shown in Fig. 3b.

C.1 Computation phase (“read”)

During each read phase, a vector x is given, andencoded in u and u component-wise by the columninput interfaces, for a duration of Trd, ∀m : um (t) =axm = −um (t), where a is a positive constant con-verting xm, a unit-less number, to voltage. Recall thatA is the maximal value of |xm|, so we require thataA < |VT |, as required in (12). Additionally, the rowinput interfaces produce voltage signal on the en lines,∀n :

en (t) =

VDD , if 0 ≤ t < 0.5Trd

−VDD , if 0.5Trd ≤ t ≤ Trd

. (14)

Page 6: Memristor-based multilayer neural networks with online gradient descent training

6

Fig. 3 Memristor-based synapse. (a) A schematic of a singlememristive synapse (without the n,m indices). The synapsereceives input voltages u and u = −u, an enable signal e,and output current I . (b) The “read” and “write” protocols -incoming signals in a single synapse and the increments in thesynaptic weight s as determined by (24). T = Twr + Trd.

From (2), the total change in the internal state variableis therefore, ∀n,m :

∆snm =

ˆ 0.5Trd

0

(axm) dt+

ˆ Trd

0.5Trd

(−axm) dt = 0 ,

(15)The zero net change in the value of snm betweenthe times of 0 and Trd implements a non-destructive“read”, as common in many memory devices. To min-imize inaccuracies due to changes in the conductanceof the memristor during the “read” phase, the row out-put interface samples the output current at time zero.Using (3), the output current of the synapse to the online at the time is thus

Inm = a(g + gsnm)xm. (16)

Therefore, the total current in each output line onequals to the sum of the individual currents producedby the synapses driving that line, i.e.,

on =∑m

Inm = a∑m

(g + gsnm)xm . (17)

The row output interface measures the output currenton, and outputs

rn = c (on − oref) (18)

where c is a constant converting the current units of onto a unit-less number rn, and

oref = ag∑m

xm . (19)

DefiningWnm = acgsnm , (20)

we obtainr = Wx , (21)

as desired.

C.2 Update phase (“write”)

During each write phase, of duration of Twr, u, umaintain their values from the “read” phase, while thesignal e changes. In this phase the row input interfacesencode e component-wise, ∀n : w

en (t) =

sign (yn)VDD , if 0 ≤ t− Trd ≤ b |yn|0 , if b |yn| < t− Trd < Twr

,

(22)The interpretation of (22) is that en is a pulse withmagnitude VDD, the same sign as yn, and a durationb |yn| (where b is a constant converting yn, a unit-lessnumber, to time units). Recall that A is the maximalvalue of |yn|, so we require that Twr > bA. The totalchange in the internal state is therefore

∆snm =

ˆ Trd+b|yn|

Trd

(asign (yn)xm) dt (23)

= abxmyn (24)

Using (20), the desired update rule for the synapticweights is therefore obtained

∆W = ηyx> , (25)

where η = a2bcg.

IV. A MULTILAYER NEURAL NETWORK CIRCUIT

So far, the circuit operation was exemplified on aSNN with the simple Adaline algorithm. In this sec-tion it is explained how, with a few adjustments, theproposed circuit can be used to implement Backprop-gation on a general MNN.

Recall the context of the “supervised learning”setup detailed in section II. Consider an estimator ofthe form

r = W2σ (W1x) ,

with W1 and W2 being the two parameter matrices.Such a double layer MNN can approximate any tar-get function with arbitrary precision [40]. Denote by

Page 7: Memristor-based multilayer neural networks with online gradient descent training

SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING7

−+

Fig. 4 Implementing the Backpropagation for a MNN with MSE error, using a modified version of the original circuit. The functionboxes denote the operation of either σ (·) or σ′ (·) on the input, and the × box denotes a component-wise product. For detailedcircuit schematics, see [31].

r1 = W1x and r2 = W2σ (r1) the output of eachlayer. Suppose we again use MSE to quantify the errorbetween d and r. In the Backpropagation algorithm,each update of the parameter matrices is given by anonline gradient decent step, from (8),

∆W1 = ηy1x>1 ; ∆W2 = ηy2x

>2 ,

with x2 , σ (r1) ,y2 , d − r2, x1 , x andy1 ,

(W>

2 y2

)× σ′ (r1), where here a × b ,

(a1b1, . . . , aMbM ), a component-wise product. Im-plementing such an algorithm requires a minor modi-fication of the proposed circuit - it should have an ad-ditional “inverted” output δ , W>y. Once this mod-ification is made to the circuit, by cascading such cir-cuits, it is straightforward to implement the Backprop-agation algorithm for two layers or more, as shown inFig. 4.

The additional output δ = W>y can be generatedby the circuit in an additional “read” phase with du-ration Trd, between the original “read” and “write”phases, in which the original role of the input and out-put lines are inverted. In this phase the NMOS transis-tor is on, i.e., ∀n, en = VDD, and the former “output”on lines are given the following voltage signal (again,used for a non-destructive read)

on =

ayn , if Trd ≤ t < 1.5Trd

−ayn , if 1.5Trd ≤ t ≤ 2Trd

.

The Inm current now exits through the (original “in-put”) um terminal, and the sum of all the currents ismeasured at time Trd by the column interface (beforeit goes into ground)

um =∑n

Inm = a∑n

(g + gsnm) yn . (26)

The total current on um at time Trd is the output

δm = c (um − uref) , (27)

where uref = ag∑n yn . Thus, from 20,

δm =∑n

Wnmyn ,

as required.Lastly, note that it is straightforward to use a differ-

ent error function instead of MSE. For example, in thetwo-layer example, recall that y2 = − 1

2∂∂r ‖d− r‖2.

If a different error function E (r,d) is required, onecan re-define

y2 , −α ∂

∂rE (r,d) (28)

where α is some constant that can be used to tune thelearning rate. To implement this change in the MNNcircuit (Fig. 4), the subtractor should be simply re-placed with some other module.

V. SOURCES OF NOISE AND VARIABILITY

Usually, analog computation suffers from reducedrobustness to noise as compared to digital computation[41]. ML algorithms are, however, inherently robustto noise, which is a key element in the set of problemsthey are designed to solve. This suggests that the ef-fects of intrinsic noise on the performance of the ana-log circuit are relatively small. These effects largelydepend on the specific circuit implementation (e.g., theCMOS process). Particularly, memristor technologyis not mature yet and memristors have not been fullycharacterized. Therefore, to check the robustness ofthe circuit, crude estimation of the magnitude of noiseand variability has been used in this section. This esti-mation is based on known sources of noise and vari-ability, which are less dependent on specific imple-mentation. In section VI, the alleged robustness of thecircuit would be evaluated by simulating the circuit inthe presence of these noise and variation sources.

Page 8: Memristor-based multilayer neural networks with online gradient descent training

8

A. Noise

When one of the transistors is enabled (e (t) =±VDD), then the current is affected by intrinsic ther-mal noise sources in the transistors and memristor ofeach synapse. Current fluctuations on a device due tothermal origin can be approximated by a white noisesignal I (t) with zero mean (〈I (t)〉 = 0) and auto-correlation 〈I (t) I (t′)〉 = σ2δ (t− t′) where δ (·)is Dirac’s delta function and σ2 = 2kT g (where kis Boltzmann constant, T is the temperature, and gis the conductance of the device). For 65 nm tran-sistors (parameters taken from IBM’s 10LPe/10RFeprocess [42]) the characteristic conductivity is g1 ∼10−4 Ω−1, and therefore for I1 (t), the thermal cur-rent source of the transistors are σ2

1 ∼ 10−24A2sec.Assume that the memristor characteristic conductiv-ity g2 = εg1, so for the thermal current source of thememristor, σ2

2 = εσ21 . Note that from (13) we have

ε 1, the resistance of the transistor is much smallerthan that of the memristor. The total voltage on thememristor is thus

VM (t) =g1

g1 + g2u (t) + (g1 + g2)

−1(I1 (t)− I2 (t))

=1

1 + εu (t) + ξ (t)

where ξ (t) = g−11 (I1 (t)− I2 (t)) and 〈ξ (t) ξ (t′)〉 ∼

σ2ξδ (t− t′) with σ2

ξ ≈ 2kT g−11 ∼ 10−16V2sec. The

equivalent circuit, including sources of noise, is shownin Fig. (5). Assuming the circuit minimal trial dura-tion is T = 10nsec, the maximum root mean squareerror due to thermal noise would be about ET ∼T−1/2σξ ∼ 10−4V .

Noise in the inputs u, u and e also exists. Accordingto [43], the relative noise in the power supply of theu/u inputs is approximately 10% in the worst case.Applying u = ax, effectively gives an input of ax +ET + Eu, where |Eu| ≤ 0.1a |x|. The absolute noiselevel in duration of e should be smaller than Tmin

clk ∼ 2·10−10sec, assuming a digital implementation of pulse-width modulation with Tmin

clk being the shortest clockcycle currently available. On every write cycle e =±VDD is therefore applied for a duration of b |y|+Ee(instead of b |y|), where |Ee| < Tclk.

B. Parameter variability

A common estimation of the variability in mem-ristor parameters is a Coefficient of Variation (CV =(standard deviation)/mean) of a few percent [44]. Inthis paper, the circuit is also evaluated with consid-erably larger variability (CV ∼ 30%), in addition tothe noise sources as described in section A. This is

Fig. 5 Noise model for artificial synapse. (a) During opera-tion, only one transistor is conducting (assume it is the N-typetransistor). (b) Thermal noise in a small signal model, thetransistor is converted to a resistor (g1) in parallel to currentsource (I1), the memristor is converted to a resistor (g2) inparallel to current source (I2), and the effects of the sourcesare summed linearly.

done by assuming that the variability in the param-eters of the memristors are random variables inde-pendently sampled from a uniform distribution g ∼U [0.5g, 1.5g]. When running the algorithm in soft-ware, these variations are equivalent to correspondingchanges in the synaptic weightsW or the learning rateη in the algorithm. Note that variability in the tran-sistor parameters is not considered, since these can af-fect the circuit operation only if (12) or (13) are inval-idated. This can happen, however, only if the valuesof K or VT vary in orders of magnitude, which is un-likely.

VI. CIRCUIT EVALUATION

In this section, the proposed circuit is implemented(section A) in a simulated physical model. Then (sec-tion B), using standard supervised learning datasets,the implementation of SNNs and MNNs using the pro-posed circuit is demonstrated, including its robustnessto noise and variation.

A. Basic functionality

First, the basic operation of the proposed design isexamined numerically. A small 2 × 2 synaptic gridcircuit is simulated for time 10T with simple inputs

(x1, x2) = (1,−0.5) · sign (t− 5T) (29)(y1, y2) = (1,−0.5) · 0.1 . (30)

The physical model of the circuit is implemented bothin SPICE (appendix D [31]) and SimElectronics [45](Fig. 6). Both are software tools which enable physi-cal circuit simulation of memristors and MOSFET de-vices. In these simulations the correct basic opera-tion of the circuit is verified. For example, notice thatthe conductance of the (n,m) memristor indeed incre-

Page 9: Memristor-based multilayer neural networks with online gradient descent training

SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING9

Type Parameter Value

Power source VDD 10V

NMOSK 5 AV−2

VT 1.7 V

PMOSK 5 AV−2

VT −1.4 V

Memristorg 1µS

g 180µS/ (V · sec)

TimingT 0.1 sec

Twr 0.21T

Scaling

a 0.1V

b Twr

c−1 10nA

TABLE I Circuit parameters.

Parameter Fig. 7a Fig. 7b

Network 30× 1 4× 4× 3

#Training samples 284 75

#Test samples 285 75

η 0.1 0.1

TABLE II Learning parameters for each dataset. In additionto the inputs from the previous layer, each neuron has constantone input (a bias).

ments linearly and proportionally to xmyn, following24.

Implementation details. The SPICE model isdescribed in appendix D. Next, the SimElectronicssynaptic grid circuit model (which was used to con-struct model hardware MNNs in the section B) is de-scribed. The circuit parameters appear in Table I, withthe parameters of the (ideal) transistors kept at theirdefaults. The memristor model is implemented using(1-3) with parameters taken from experimental data[46, Fig. 2]. As depicted schematically in Fig. 2, thecircuit implementation consists of a synapse-grid andthe interface blocks. The interface units were imple-mented using a few standard CMOS and logic compo-nents, as can be see in the detailed schematics [31](HTML, open using Firefox). The circuit operatedsynchronously, using global control signals, suppliedexternally. The input and output to the circuit werekept constant in each trial using “sample and hold”units.

B. Learning performance

The synaptic grid circuit model is used to imple-ment a SNN and a MNN, trainable by the online gradi-ent descent algorithm. In order to demonstrate that thealgorithm is indeed implemented correctly by the pro-

posed circuit, the circuit performance has been com-pared to an algorithmic implementation of the MNNsin (Matlab) software. Two standard tasks are used:

1. The Wisconsin Breast Cancer diagnosis task [47](linearly separable).

2. The iris classification task [47] (not linearly sep-arable).

The first task was evaluated using an SNN circuit, sim-ilarly to Fig. 1. The second task is evaluated on a 2-layer MNN circuit, similarly to Fig. 4. Fig. 7 showsthe training phase performance of

1. The algorithm.2. The proposed circuit (implementing the algo-

rithm).3. The circuit, with about 10% noise and 30% vari-

ability (section V).Also, Table III shows the generalization errors aftertraining was over. On each of the two tasks, the per-formance of the circuit and the algorithm similarly im-proves, finally reaching a similar generalization error.These results indicate that the proposed circuit designcan precisely implement MNNs trainable with onlinegradient descent, as expected, from the derivations insection III. Moreover, the circuit exhibits considerablerobustness, as its performance is only mildly affectedby significant levels of noise and variation.

Implementation details. The SNN and the MNNwere implemented using the (SimElectronics) synap-tic grid circuit model which was described in sectionA. The detailed schematics of the 2-layer MNN modelare again given in [31] (HTML, open using Firefox),and also the code. All simulations are done using adesktop computer with core i7 930 processor, a Win-dows 8.1 operating system, and a Matlab 2013b envi-ronment. Each circuit simulation takes about one dayto complete (note the software implementation of thecircuit does not exploit the parallel operation of thecircuit hardware). For each k sample from the dataset,the attributes are converted to a M -long vector x(k),and each label is converted to a N -long binary (±1)vector d(k). The training set is repeatedly presented,with samples at random order. The inputs are cen-tralized and normalized as recommended by [6], withadditional rescaling made done in dataset 1, to insurethat the inputs fall within the circuit specifications 12.The performance was averaged over 10 repetitions andthe training error was also averaged over 20 samples.Cross entropy error was used instead of MSE, since itreduces classification error [48]. The initial weightswere set as sampled independently from a symmet-ric uniform distribution. The neuronal activation func-tions were set as σ (x) = 1.7159 tanh (2x/3), as rec-

Page 10: Memristor-based multilayer neural networks with online gradient descent training

10

0 0.2 0.4 0.6 0.8 1−1

0

1

t [sec]

x1

0 0.2 0.4 0.6 0.8 1−5

0

5

r

r1r2

0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

t [sec]

x2

0 0.2 0.4 0.6 0.8 1−5

0

5

r

r1r2

0 0.2 0.4 0.6 0.8 1−100

0

100

t [sec]

V11[m

V]

0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

gs 1

1[µS]

0 0.2 0.4 0.6 0.8 1−100

−50

0

50

100

t [sec]

V12[m

V]

0 0.2 0.4 0.6 0.8 1−0.2

−0.1

0

0.1

0.2

gs 1

2[µS]

0 0.2 0.4 0.6 0.8 1−100

−50

0

50

100

t [sec]

V21[m

V]

0 0.2 0.4 0.6 0.8 1−0.2

−0.1

0

0.1

0.2

gs 2

1[µS]

0 0.2 0.4 0.6 0.8 1−50

0

50

t [sec]

V22[m

V]

0 0.2 0.4 0.6 0.8 1−0.1

0

0.1

gs 2

2[µS]

Fig. 6 Synaptic 2 × 2 grid circuit simulation, during ten operation cycles. Top: Circuit Inputs (x1, x2) and result outputs (r1, r2).Middle and bottom: voltage (solid black) and conductance (dashed red) change for each memristor in the grid. Notice that theconductance of the (n,m) memristor indeed changes proportionally to xmyn, following 24 . Simulation was done using SimElec-tronics, with inputs as in (29-30), and circuit parameters as in Table I. Similar results were obtained using SPICE with linear ion driftmemristor model and CMOS 0.18µm process, as shown in appendix D [31].

0 200 400 600 800 1000 12000

0.2

0.4

0.6

0.8

samples

TrainingError

Algorithm

Circuit

Noisy+variable circuit

0 200 400 600 800 1000 12000

0.2

0.4

0.6

0.8

samples

TrainingError

Algorithm

Circuit

Noisy+variable circuit

Fig. 7 Circuit evaluation - training error for two datasets: (a)Wisconsin breast cancer diagnosis (b) Iris classification.

ommended by [6]. Other parameters are given in Ta-bles I and II.

VII. DISCUSSION

As explained in section II and IV, two major compu-tational bottlenecks of Multilayer Neurnal Networks(MNNs), and many Machine Learning (ML) algo-rithms, are given by “matrix×vector” product oper-ation (4) and a “vector×vector” outer product oper-ation (9). Both are of order O (M ·N), where Mand N are the sizes of the input and output vectors.

Fig. 7a Fig. 7b

(a) 1.3%± 0.7% 2.9%± 2%

(b) 1.5%± 0.7% 2.8%± 1.9%

(c) 1.5%± 0.7% 4.7%± 2.4%

TABLE III Circuit evaluation - generalization error for:(a) Algorithm (b) Circuit (c) Circuit with noise andvariability. The standard deviation was calculated as√Pe (1 − Pe) /Ntest, withNtest being the number of sam-

ples in the test set and Pe being the generalization error.

In this paper, the proposed circuit is designed specif-ically to deal with these bottlenecks using memristorarrays. This design has a relatively small number ofcomponents in each array element - one memristorand two transistors. The physical grid-like structureof the arrays implements the “matrix×vector” prod-uct operation (4) using analog summation of currents,while the memristor dynamics enable us to performthe “vector×vector” outer product operation (10), us-ing “time×voltage” encoding paradigm. The idea touse a resistive grid to perform “matrix×vector” prod-uct operation is not new (e.g., [49]). The main nov-elty of this paper is the use of memristors togetherwith “time×voltage” encoding, which allows us toperform a mathematically accurate “vector×vector”outer product operation in the learning rule using asmall number of components.

Page 11: Memristor-based multilayer neural networks with online gradient descent training

SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING11

TABLE IV Hardware designs of artificial synapses imple-menting scalable online learning algorithms

Design #Transistors Comments

Proposed design 2 (+1 memristor)

[50] 2Also requires UV light +

Weights decay ~ minutes

[51] 6Weights only

increase (unusable)

[52, 53] 39 Must keep training

[54] 52 Must keep training

[55] 92 Weights decay ~ hours

[56] 83 Also requires a “weight unit”

[57] 150

A. Previous CMOS-based designs

As mentioned in the introduction, CMOS hardwaredesigns that specifically implement online learning al-gorithms remain an unfulfilled promise at this point.

The main incentive for existing hardware solutionsis the inherent inefficiency in implementing these al-gorithms in software running on general purpose hard-ware (e.g., CPU’s, DSP’s and GPU’s). However,squeezing the required circuit for both the computationand the update phases (two configurable multipliersand a memory element) into an array cell has proven tobe a hard task, using currently available CMOS tech-nology. Off-chip or chip-in-the-loop design architec-tures (e.g., [58], Table 1) have been suggested in manycases as a way around this design barrier. These de-signs, however, generally deal with the computationalbottleneck of the “matrix×vector” product operationin the computation phase, rather than the computa-tional bottleneck of the “vector×vector” outer productoperation in the update phase. Additionally, these so-lutions are only useful in cases where the training isnot continuous and is done in a pre-deployment phaseor on special reconfiguration phases during the oper-ation. Other designs implement non-standard (e.g.,perturbation algorithms [59]) or specifically tailoredlearning algorithms (e.g., modified Backpropagationfor spiking neurons [60]). However, it remains to beseen whether such algorithms are indeed scalable.

Hardware designs of artificial synaptic arrays thatare capable of implementing common (scalable) on-line gradient-descent based learning, are listed in Ta-ble IV. For large arrays (i.e., largeM andN ) the effec-tive transistor count per synapse (where resistors andcapacitors were also counted as transistors) is approx-imately proportional to the required area and averagestatic power usage of the circuit.

The smallest synaptic circuit [50], includes twotransistors, similarly to our design, but requires the(rather unusual) use of UV illumination during its op-eration, and has the disadvantage of having volatileweights, decaying within minutes. The next device[51] includes six transistors per synapse, but the up-date rule can only increase the synaptic weights, whichmakes the device unusable for practical purposes. Thenext device [53, 61] suggested a grid design usingCMOS Gilbert multipliers, resulting in 39 transistorsper synapse. In a similar design, [54], 52 transistorsare used. Both these devices use capacitive elementsfor analog memory and suffer from the limitation ofhaving volatile weights, vanishing after training hasstopped. They require therefore constant re-training(practically acting as “refresh”). Such retraining is re-quired also on each start-up, or, alternatively, readingout the weights into an auxiliary memory - a solu-tion which requires a mechanism for reading out thesynaptic weights. The larger design in [55] (92 tran-sistors) also has weight decay, but with a slow hours-long timescale. The device in [56] (83 transistors +an unspecified “weight unit” which stores the weights)does not report to have weight decay, apparently sincedigital storage is used. This is also true for [57] (150transistors).

The proposed memristor-based design should re-solve these obstacles, and provide a compact, non-volatile circuit. Area and power consumption are ex-pected to be reduced by a factor between 13 to 50, incomparison to standard CMOS technology, if a mem-ristor is counted as an additional transistor (althoughit is actually more compact). Maybe the most con-vincing evidence for the limitations of these designsis the fact that, although most of these designs aretwo decades old, they have not been incorporated intocommercial products. It is only fair to mention atthis point, that while our design is purely theoreticaland based on speculative technology, the above re-viewed designs are based on mature technology, andhave overcome obstacles all the way to manufacturing.

B. Circuit modification and generalizations

The specific parameters that were used for the cir-cuit evaluation I, are not strictly necessary for properexecution of the proposed design, and are only usedto demonstrate its applicability. For example, it isstraightforward to show that K,VT and g have littleeffect (as long as (12-13) hold), and different values ofg can be adjusted for by rescaling c. This is importantsince the feasible range of parameters for the memris-tive devices is still not well characterized, and it seems

Page 12: Memristor-based multilayer neural networks with online gradient descent training

12

to be quite broad. For example, the values of the mem-ristive timescales range from picoseconds [62] to mil-liseconds [46]. Here the parameters were taken froma millisecond-timescale memristor [46]. The actualspeed of the proposed circuit would critically dependon the timescales of commercially available memristordevices.

Additional important modifications of the proposedcircuit are straightforward. In appendix A it is ex-plained how to modify the circuit to work with morerealistic “memristive devices” [13, 32], instead of theclassical memristor model [11], given a few condi-tions. In appendix C, it is shown that it is possibleto reduce the transistor count from two to one, at theprice of doubling the duration of the “write” phase.Other useful modifications of circuit are also possible.For example, the input x may be allowed to receivedifferent values during the “read” and “write” opera-tions. Also, it is straightforward to replace the simpleouter product update rule in (10) by more general up-date rules of the form

W (k)nm = W (k−1)

nm + η∑i,j

fi

(y(k)n

)gj

(x(k)m

),

where fi, gj are some functions. Lastly, it is possibleto adaptively modify the learning rate during training(e.g., by modifying α in Eq. 28).

VIII. CONCLUSIONS

A novel method to implement scalable online gra-dient descent learning in multilayer neural networksthrough local update rules is proposed, based onthe emerging memristor technology. The proposedmethod is based on an “artificial synapse” using onememristor, to store the “synaptic weight”, and twoCMOS transistors to control the circuit. The correct-ness of the proposed synapse structure exhibits sim-ilar accuracy to its equivalent software implementa-tion, while the proposed structure shows high robust-ness and immunity to noise and parameter variability.

Such circuits may be used to implement large-scaleonline learning in multilayer neural networks, as wellas other and other learning systems. The circuit is esti-mated to be significantly smaller than existing CMOS-only designs, opening the opportunity for massive par-allelism with millions of adaptive synapses on a singleintegrated circuit, operating with low static power andgood robustness. Such brain-like properties may givea significant boost to the field of neural networks andlearning systems.

REFERENCES

[1] Q. V. Le, R. Monga, M. Devin, and G. Corrado, “Buildinghigh-level features using large scale unsupervised learning,”in ICML ’12, 2012, pp. 81–88.

[2] D. Ciresan, “Multi-column deep neural networks for imageclassification,” in CVPR ’12, 2012, pp. 3642–3649.

[3] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V.Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang,and A. Y. Ng, “Large scale distributed deep networks,” NIPS,2012.

[4] R. D. Hof, “Deep Learning,” MIT Technology Review, 2013.[5] D. Hernandez, “Now You Can Build Google’s 1M Artificial

Brain on the Cheap,” Wired, 2013.[6] Y. LeCun, L. Bottou, G. B. Orr, and K. R. Müller, “Efficient

backprop,” in Neural networks: Tricks of the Trade, 2012.[7] J. Misra and I. Saha, “Artificial neural networks in hardware:

A survey of two decades of progress,” Neurocomputing,vol. 74, no. 1-3, pp. 239–255, Dec. 2010.

[8] A. R. Omondi, “Neurocomputers: a dead end?” Internationaljournal of neural systems, vol. 10, no. 6, pp. 475–481, 2000.

[9] M. Versace and B. Chandler, “The brain of a new machine,”Spectrum, IEEE, vol. 47, no. 12, pp. 30–37, 2010.

[10] M. M. Waldrop, “Neuroelectronics: Smart connections.”Nature, vol. 503, no. 7474, pp. 22–4, Nov. 2013.

[11] L. Chua, “Memristor-the missing circuit element,” CircuitTheory, IEEE Transactions on, vol. 18, no. 5, pp. 507–519,1971.

[12] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams,“The missing memristor found,” Nature, vol. 453, no. 7191,pp. 80–83, 2008.

[13] S. Kvatinsky, E. G. Friedman, A. Kolodny, and U. C.Weiser, “TEAM: ThrEshold Adaptive Memristor Model,”IEEE Transactions on Circuits and Systems I: RegularPapers, vol. 60, no. 1, pp. 211–221, 2013.

[14] S. Kvatinsky, Y. H. Nacson, Y. Etsion, E. G. Friedman,A. Kolodny, and U. C. Weiser, “Memristor-based Multithread-ing,” Computer Architecture Letters, vol. PP, no. 99, p. 1,2013.

[15] S. P. Adhikari, C. Yang, H. Kim, and L. Chua, “MemristorBridge Synapse-Based Neural Network and Its Learning,”IEEE Transactions on Neural Networks and LearningSystems., vol. 23, no. 9, pp. 1426–1435, Sep. 2012.

[16] D. Querlioz and O. Bichler, “Simulation of a memristor-basedspiking neural network immune to device variations,” NeuralNetworks (IJCNN), The 2011 International Joint Conferenceon. IEEE, pp. 1775–1781, 2011.

[17] C. Zamarreño Ramos, L. A. Camuñas Mesa, J. A.Pérez-Carrasco, T. Masquelier, T. Serrano-Gotarredona, andB. Linares-Barranco, “On spike-timing-dependent-plasticity,memristive devices, and building a self-learning visualcortex,” Frontiers in neuroscience, vol. 5, p. 26, Jan. 2011.

[18] O. Kavehei, S. Al-Sarawi, K.-R. Cho, N. Iannella, S.-J. Kim,K. Eshraghian, and D. Abbott, “Memristor-based synapticnetworks and logical operations using in-situ computing,”2011 Seventh International Conference on Intelligent Sensors,Sensor Networks and Information Processing, pp. 137–142,Dec. 2011.

[19] A. Nere, U. Olcese, D. Balduzzi, and G. Tononi, “Aneuromorphic architecture for object recognition and motionanticipation using burst-STDP.” PloS one, vol. 7, no. 5, p.e36958, Jan. 2012.

[20] Y. Kim, Y. Zhang, and P. Li, “A digital neuromorphicVLSI architecture with memristor crossbar synaptic array formachine learning,” SOC Conference (SOCC), 2012 IEEE . . . ,pp. 328–333, 2012.

[21] W. Chan and J. Lohn, “Spike timing dependent plasticitywith memristive synapse in neuromorphic systems,” The 2012International Joint Conference on Neural Networks (IJCNN),pp. 1–6, Jun. 2012.

[22] T. Serrano-Gotarredona, T. Masquelier, T. Prodromakis,

Page 13: Memristor-based multilayer neural networks with online gradient descent training

SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING13

G. Indiveri, and B. Linares-Barranco, “STDP and STDPvariations with memristors for spiking neuromorphic learningsystems.” Frontiers in neuroscience, vol. 7, no. February, p. 2,Jan. 2013.

[23] D. Querlioz, O. Bichler, P. Dollfus, and C. Gamrat,“Immunity to Device Variations in a Spiking Neural NetworkWith Memristive Nanodevices,” IEEE Transactions onNanotechnology, vol. 12, no. 3, pp. 288–295, May 2013.

[24] R. Legenstein, C. Naeger, and W. Maass, “What can aneuron learn with spike-timing-dependent plasticity?” Neuralcomputation, vol. 17, no. 11, pp. 2337–82, 2005.

[25] D. Chabi and W. Zhao, “Robust neural logic block (NLB)based on memristor crossbar array,” Nanoscale Architectures(NANOARCH), 2011 IEEE/ACM International Symposiumon. IEEE, pp. 137–143, Jun. 2011.

[26] H. Manem, J. Rajendran, and G. S. Rose, “Stochastic GradientDescent Inspired Training Technique for a CMOS/NanoMemristive Trainable Threshold Gate Array,” Circuits andSystems I: Regular Papers, vol. 59, no. 5, pp. 1051–1060,2012.

[27] M. Soltiz, D. Kudithipudi, C. Merkel, G. Rose, andR. Pino, “Memristor-based neural logic blocks for non-linearly separable functions,” Computers, IEEE Transactionson, vol. 62, no. 8, pp. 1597–1606, 2013.

[28] L. Bottou and O. Bousquet, “The Tradeoffs of Large-ScaleLearning,” in Optimization for Machine Learning, 2011, p.351.

[29] D. Ciresan and U. Meier, “Deep, big, simple neural nets forhandwritten digit recognition,” Neural Computation, vol. 22,no. 12, pp. 3207–3220, 2010.

[30] Z. Vasilkoski, H. Ames, B. Chandler, A. Gorchetchnikov,J. Leveille, G. Livitz, E. Mingolla, and M. Versace,“Review of stability properties of neural plasticity rules forimplementation on memristive neuromorphic hardware,” The2011 International Joint Conference on Neural Networks, pp.2563–2569, Jul. 2011.

[31] “See Supplemental Material.”[32] L. Chua, “Memristive devices and systems,” Proceedings of

the IEEE, vol. 64, no. 2, 1976.[33] B. Widrow and M. Hoff, “Adaptive switching circuits,”

STANFORD ELECTRONICS LABS, Stanford, CA, Tech.Rep., 1960.

[34] B. Widrow and S. Stearns, Adaptive signal processing. En-glewood Cliffs, NJ: Prentice-Hall, inc, 1985.

[35] F. Rosenblatt, “The perceptron: a probabilistic modelfor information storage and organization in the brain.”Psychological review, vol. 65, no. 6, p. 386, 1958.

[36] E. Oja, “Simplified neuron model as a principal componentanalyzer,” Journal of mathematical biology, vol. 15, no. 3, pp.267–273, 1982.

[37] C. M. Bishop, Pattern recognition and machine learning.Singapore: Springer, 2006.

[38] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter,“Pegasos: primal estimated sub-gradient solver for SVM,”Mathematical Programming, vol. 127, no. 1, pp. 3–30, Oct.2010.

[39] S. Kvatinsky, E. G. Friedman, A. Kolodny, and U. C. Weiser,“Memristor-based Material Implication (IMPLY) Logic: De-sign Principles and Methodologies,” submitted to IEEE Trans-actions on Very Large Scale Integration (VLSI), 2013.

[40] G. Cybenko, “Approximation by superpositions of a sigmoidalfunction,” Mathematics of Control, Signals, and Systems(MCSS), vol. 2, pp. 303–314, 1989.

[41] R. Sarpeshkar, “Analog versus digital: extrapolating fromelectronics to neurobiology,” Neural Computation, vol. 10,no. 7, pp. 1601–1638, 1998.

[42] “http://www.mosis.com.”[43] G. Huang, D. C. Sekar, A. Naeemi, K. Shakeri, and J. D.

Meindl, “Compact Physical Models for Power Supply Noiseand Chip/Package Co-Design of Gigascale Integration,”

in 2007 Proceedings 57th Electronic Components andTechnology Conference. Ieee, 2007, pp. 1659–1666.

[44] M. Hu, H. Li, Y. Chen, X. Wang, and R. E. Pino, “Geometryvariations analysis of TiO2 thin-film and spintronic mem-ristors,” in 16th Asia and South Pacific Design AutomationConference (ASP-DAC 2011). Ieee, Jan. 2011, pp. 25–30.

[45] “http://www.mathworks.com/products/simelectronics/.”[46] T. Chang, S.-H. Jo, K.-H. Kim, P. Sheridan, S. Gaba, and

W. Lu, “Synaptic behaviors and modeling of a metal oxidememristive device,” Applied Physics A, vol. 102, no. 4, pp.857–863, Feb. 2011.

[47] K. Bache and M. Lichman, “UCI Machine Learning Reposi-tory,” http://archive.ics.uci.edu/ml, 2013.

[48] P. Simard, D. Steinkraus, and J. Platt, “Best practices forconvolutional neural networks applied to visual documentanalysis,” Seventh International Conference on DocumentAnalysis and Recognition, 2003. Proceedings., vol. 1, no.Icdar, pp. 958–963, 2003.

[49] D. B. Strukov and K. K. Likharev, “Reconfigurable nano-crossbar architectures,” in Nanoelectronics and InformationTechnology, 2012, pp. 543–562.

[50] G. Cauwenberghs, “Analysis and verification of an analogVLSI incremental outer-product learning system,” IEEEtransactions on neural networks, 1992.

[51] H. Card, C. R. Schneider, and W. R. Moore, “Hebbianplasticity in mos synapses,” IEEE Transactions on Audio andElectroacoustics, vol. 138, no. 1, p. 13, 1991.

[52] C. Schneider and H. Card, “Analogue CMOS HebbianSynapses,” Electronics letters, pp. 1990–1991, 1991.

[53] H. Card, C. R. Schneider, and R. S. Schneider, “Learningcapacitive weights in analog CMOS neural networks,” Journalof VLSI Signal Processing, vol. 8, no. 3, pp. 209–225, Oct.1994.

[54] M. Valle, D. D. Caviglia, and G. M. Bisio, “An experimentalanalog VLSI neural network with on-chip back-propagationlearning,” Analog Integrated Circuits and Signal Processing,vol. 9, no. 3, pp. 231–245, 1996.

[55] T. Morie and Y. Amemiya, “An all-analog expandable neuralnetwork LSI with on-chip backpropagation learning,” IEEEJournal of Solid-State Circuits, vol. 29, no. 9, pp. 1086–1093,1994.

[56] C. Lu, B. Shi, and L. Chen, “An on-chip BP learningneural network with ideal neuron characteristics and learningrate adaptation,” Analog Integrated Circuits and SignalProcessing, pp. 55–62, 2002.

[57] T. Shima and T. Kimura, “Neuro chips with on-chipback-propagation and/or Hebbian learning,” IEEE Journal ofSolid-State Circuits, vol. 27, no. 12, 1992.

[58] C. S. Lindsey and T. Lindblad, “Survey of neural networkhardware,” in Proc. SPIE, Applications and Science of Arti-ficial Neural Networks, vol. 2492, 1995, pp. 1194–1205.

[59] G. Cauwenberghs, “A learning analog neural network chipwith continuous-time recurrent dynamics,” NIPS, 1994.

[60] H. Eguchi, T. Furuta, and H. Horiguchi, “Neural network LSIchip with on-chip learning,” Neural Networks, pp. 453–456,1991.

[61] C. Schneider and H. Card, “CMOS implementation ofanalog Hebbian synaptic learning circuits,” IJCNN-91-SeattleInternational Joint Conference on Neural Networks, vol. i,pp. 437–442, 1991.

[62] A. C. Torrezan, J. P. Strachan, G. Medeiros-Ribeiro, and R. S.Williams, “Sub-nanosecond switching of a tantalum oxidememristor.” Nanotechnology, vol. 22, no. 48, p. 485203, Dec.2011.