tino springerhci.2015

17
Artificial Neu 455 Part D | 27.1 27. Artificial Neural Network Models Peter Tino, Lubica Benuskova, Alessandro Sperduti We outline the main models and developments in the broad field of artificial neural networks (ANN). A brief introduction to biological neurons motivates the initial formal neuron model – the perceptron. We then study how such formal neu- rons can be generalized and connected in network structures. Starting with the biologically motivated layered structure of ANN (feed-forward ANN), the networks are then generalized to include feedback loops (recurrent ANN) and even more abstract gen- eralized forms of feedback connections (recursive neuronal networks) enabling processing of struc- tured data, such as sequences, trees, and graphs. We also introduce ANN models capable of form- ing topographic lower-dimensional maps of data (self-organizing maps). For each ANN type we out- 27.1 Biological Neurons ............................... 455 27.2 Perceptron .......................................... 456 27.3 Multilayered Feed-Forward ANN Models ......................................... 458 27.4 Recurrent ANN Models .......................... 460 27.5 Radial Basis Function ANN Models ........ 464 27.6 Self-Organizing Maps ........................... 465 27.7 Recursive Neural Networks ................... 467 27.8 Conclusion........................................... 469 References ................................................... 470 line the basic principles of training the corre- sponding ANN models on an appropriate data collection. The human brain is arguably one of the most excit- ing products of evolution on Earth. It is also the most powerful information processing tool so far. Learning based on examples and parallel signal processing lead to emergent macro-scale behavior of neural networks in the brain, which cannot be easily linked to the be- havior of individual micro-scale components (neurons). In this chapter, we will introduce artificial neural net- work (ANN) models motivated by the brain that can learn in the presence of a teacher. During the course of learning the teacher specifies what the right responses to input examples should be. In addition, we will also mention ANNs that can learn without a teacher, based on principles of self-organization. To set the context, we will begin by introducing ba- sic neurobiology. We will then describe the perceptron model, which, even though rather old and simple, is an important building block of more complex feed-forward ANN models. Such models can be used to approximate complex non-linear functions or to learn a variety of as- sociation tasks. The feed-forward models are capable of processing patterns without temporal association. In the presence of temporal dependencies, e.g., when learning to predict future elements of a time series (with certain prediction horizon), the feed-forward ANN needs to be extended with a memory mechanism to account for temporal structure in the data. This will naturally lead us to recurrent neural network models (RNN), which besides feed-forward connections also contain feedback loops to preserve, in the form of the information pro- cessing state, information about the past. RNN can be further extended to recursive ANNs (RecNN), which can process structured data such as trees and acyclic graphs. 27.1 Biological Neurons It is estimated that there are about 10 12 neural cells (neurons) in the human brain. Two-thirds of the neurons form a 46 mm thick cortex that is assumed to be the center of cognitive processes. Within each neuron com-

Upload: jaher-wasim

Post on 02-Dec-2015

226 views

Category:

Documents


1 download

DESCRIPTION

Tino SpringerHCI.2015

TRANSCRIPT

Page 1: Tino SpringerHCI.2015

Artificial Neu455

PartD|27.1

27. Artificial Neural Network Models

Peter Tino, Lubica Benuskova, Alessandro Sperduti

We outline the main models and developmentsin the broad field of artificial neural networks(ANN). A brief introduction to biological neuronsmotivates the initial formal neuron model – theperceptron. We then study how such formal neu-rons can be generalized and connected in networkstructures. Starting with the biologically motivatedlayered structure of ANN (feed-forward ANN), thenetworks are then generalized to include feedbackloops (recurrent ANN) and even more abstract gen-eralized forms of feedback connections (recursiveneuronal networks) enabling processing of struc-tured data, such as sequences, trees, and graphs.We also introduce ANN models capable of form-ing topographic lower-dimensional maps of data(self-organizing maps). For each ANN type we out-

27.1 Biological Neurons ............................... 455

27.2 Perceptron .......................................... 456

27.3 Multilayered Feed-ForwardANN Models ......................................... 458

27.4 Recurrent ANN Models .......................... 460

27.5 Radial Basis Function ANN Models ........ 464

27.6 Self-Organizing Maps........................... 465

27.7 Recursive Neural Networks ................... 467

27.8 Conclusion........................................... 469

References ................................................... 470

line the basic principles of training the corre-sponding ANN models on an appropriate datacollection.

The human brain is arguably one of the most excit-ing products of evolution on Earth. It is also the mostpowerful information processing tool so far. Learningbased on examples and parallel signal processing leadto emergent macro-scale behavior of neural networksin the brain, which cannot be easily linked to the be-havior of individual micro-scale components (neurons).In this chapter, we will introduce artificial neural net-work (ANN) models motivated by the brain that canlearn in the presence of a teacher. During the course oflearning the teacher specifies what the right responsesto input examples should be. In addition, we will alsomention ANNs that can learn without a teacher, basedon principles of self-organization.

To set the context, we will begin by introducing ba-sic neurobiology. We will then describe the perceptronmodel, which, even though rather old and simple, is an

important building block of more complex feed-forwardANN models. Such models can be used to approximatecomplex non-linear functions or to learn a variety of as-sociation tasks. The feed-forward models are capable ofprocessing patterns without temporal association. In thepresence of temporal dependencies, e.g., when learningto predict future elements of a time series (with certainprediction horizon), the feed-forward ANN needs tobe extended with a memory mechanism to account fortemporal structure in the data. This will naturally leadus to recurrent neural network models (RNN), whichbesides feed-forward connections also contain feedbackloops to preserve, in the form of the information pro-cessing state, information about the past. RNN can befurther extended to recursive ANNs (RecNN), whichcan process structured data such as trees and acyclicgraphs.

27.1 Biological Neurons

It is estimated that there are about 1012 neural cells(neurons) in the human brain. Two-thirds of the neurons

form a 4�6mm thick cortex that is assumed to be thecenter of cognitive processes. Within each neuron com-

Page 2: Tino SpringerHCI.2015

PartD|27.2

456 Part D Neural Networks

Dendrites

Axon

Terminal

Soma

Fig. 27.1 Schematic illustration of thebasic information processing struc-ture of the biological neuron

plex biological processes take place, ensuring that it canprocess signals from other neurons, as well as send itsown signals to them. The signals are of electro-chemicalnature. In a simplified way, signals between the neuronscan be represented by real numbers quantifying the in-tensity of the incoming or outgoing signals. The pointof signal transmission from one neuron to the other iscalled the synapse. Within synapse the incoming sig-nal can be reinforced or damped. This is represented bythe weight of the synapse. A single neuron can have upto 103�105 such points of entry (synapses). The inputto the neuron is organized along dendrites and the soma(Fig. 27.1). Thousands of dendrites form a rich tree-likestructure on which most synapses reside.

Signals from other neurons can be either excitatory(positive) or inhibitory (negative), relayed via exci-tatory or inhibitory synapses. When the sum of thepositive and negative contributions (signals) from otherneurons, weighted by the synaptic weights, becomesgreater than a certain excitation threshold, the neuronwill generate an electric spike that will be transmittedover the output channel called the axon. At the end of

axon, there are thousands of output branches whose ter-minals form synapses on other neurons in the network.Typically, as a result of input excitation, the neuron cangenerate a series of spikes of some average frequency –about 1� 102 Hz. The frequency is proportional to theoverall stimulation of the neuron.

The first principle of information coding and rep-resentation in the brain is redundancy. It means thateach piece of information is processed by a redun-dant set of neurons, so that in the case of partialbrain damage the information is not lost completely. Asa result, and crucially – in contrast to conventional com-puter architectures, gradually increasing damage to thecomputing substrate (neurons plus their interconnec-tion structure) will only result in gradually decreasingprocessing capabilities (graceful degradation). Further-more, it is important what set of neurons participatein coding a particular piece of information (distributedrepresentation). Each neuron can participate in cod-ing of many pieces of information, in conjunction withother neurons. The information is thus associated withpatterns of distributed activity on sets of neurons.

27.2 Perceptron

The perceptron is a simple neuron model that takes in-put signals (patterns) coded as (real) input vectors NxD.x1; x2; : : : ; xnC1/ through the associated (real) vectorof synaptic weights NwD .w1;w2; : : : ;wnC1/. The out-put o is determined by

oD f .net/D f . Nw Nx/D

f

0@nC1X

jD1

wjxj

1AD f

0@ nX

jD1

wjxj � �1A ; (27.1)

where net denotes the weighted sum of inputs, (i. e., dotproduct of weight and input vectors), and f is the acti-vation function. By convention, if there are n inputs to

the perceptron, the input .nC1/ will be fixed to �1 andthe associated weight to wnC1 D � , which is the valueof the excitation threshold.

w·x o

wn+1 = θ

w1

w2

xn+1 = –1

x1

x2

···

Fig. 27.2 Schematic illustration of the perceptron model

Page 3: Tino SpringerHCI.2015

Artificial Neural Network Models 27.2 Perceptron 457Part

D|27.2

In 1958 Rosenblatt [27.1] introduced a discreteperceptron model with a bipolar activation function(Fig. 27.2)

f .net/D sign.net/

D

8̂̂ˆ̂<ˆ̂̂̂:C1 if net � 0,

nXjD1

wjxj � �

�1 if net < 0,nX

jD1

wjxj < � :

(27.2)

The boundary equation

nXjD1

wjxj � � D 0 ; (27.3)

parameterizes a hyperplane in n-dimensional space withnormal vector Nw.

The perceptron can classify input patterns into twoclasses, if the classes can indeed be separated by an.n� 1/-dimensional hyperplane (27.3). In other words,the perceptron can deal with linearly-separable prob-lems only, such as logical functions AND or OR. XOR,on the other hand, is not linearly separable (Fig. 27.3).Rosenblatt showed that there is a simple training rulethat will find the separating hyperplane, provided thatthe patterns are linearly separable.

As we shall see, a general rule for training manyANN models (not only the perceptron) can be for-mulated as follows: the weight vector Nw is changedproportionally to the product of the input vector anda learning signal s. The learning signal s is a functionof Nw, Nx, and possibly a teacher feedback d

sD s. Nw; Nx; d/ or sD s. Nw; Nx/ : (27.4)

In the former case, we talk about supervised learning(with direct guidance from a teacher); the latter case isknown as unsupervised learning. The update of the j-thweight can be written as

wj.tC 1/D wj.t/C�wj.t/D wj.t/C˛s.t/xj.t/ :

(27.5)

xx

x

x x

x

Fig. 27.3 Linearly separable and non-separable problems

The positive constant 0< ˛ � 1 is called the learningrate.

In the case of the perceptron, the learning signal isthe disproportion (difference) between the desired (tar-get) and the actual (produced by the model) response,sD d�oD ı. The update rule is known as the ı (delta)rule

�wj D ˛.d� o/xj : (27.6)

The same rule can, of course, be used to update the ac-tivation threshold wnC1 D � .

Consider a training set

Atrain D f.Nx1; d1/.Nx2; d2/ : : : .Nxp; dp/ : : : .NxP; dP/g

consisting of P (input,target) couples. The perceptrontraining algorithm can be formally written as:

� Step 1: Set ˛ 2 .0; 1i. Initialize the weights ran-domly from .�1; 1/. Set the counters to kD 1, pD1 (k indexes sweep through Atrain, p indexes individ-ual training patterns).� Step 2: Consider input Nxp, calculate the output oDsign.

PnC1jD1 wjx

pj /.� Step 3: Weight update: wj wjC ˛.dp � op/xpj , for

jD 1; : : : ; nC 1.� Step 4: If p< P, set p pC 1, go to step 2. Other-wise go to step 5.� Step 5: Fix the weights and calculate the cumulativeerror E on Atrain.� Step 6: If ED 0, finish training. Otherwise, set pD1, kD kC 1 and go to step 2. A new training epochstarts.

Page 4: Tino SpringerHCI.2015

PartD|27.3

458 Part D Neural Networks

27.3 Multilayered Feed-Forward ANN Models

A breakthrough in our ability to construct and trainmore complex multilayered ANNs came in 1986, whenRumelhart et al. [27.2] introduced the error back-propagation method. It is based on making the transferfunctions differentiable (hence the error functional to beminimized is differentiable as well) and finding a localminimum of the error functional by the gradient-basedsteepest descent method.

We will show derivation of the back-propagationalgorithm for two-layer feed-forward ANN as demon-strated, e.g., in [27.3]. Of course, the same principlescan be applied to a feed-forward ANN architecture withany (finite) number of layers. In feed-forward ANNsneurons are organized in layers. There are no connec-tions among neurons within the same layer; connectionsonly exist between successive layers. Each neuron fromlayer l has connections to each neuron in layer lC 1.

As has already been mentioned, the activation func-tions need to differentiable and are usually of thesigmoid shape. The most common activation functionsare

� Unipolar sigmoid:

f .net/D 1

1C exp.��net/ (27.7)

� Bipolar sigmoid (hyperbolic tangent):

f .net/D 2

1C exp.��net/ � 1 : (27.8)

The constant � > 0 determines steepness of the sig-moid curve and it is commonly set to 1. In the limit�!1 the bipolar sigmoid tends to the sign function(used in the perceptron) and the unipolar sigmoid tendsto the step function.

Consider the single-layer ANN in Fig. 27.4. Theoutput and input vectors are NyD .y1; : : : ; yj; : : : ; yJ/and NoD .o1; : : : ; ok; : : : ; oK/, respectively, where ok Df .netk/ and

netk DJX

jD1

wkjyj : (27.9)

Set yJ D�1 and wkJ D �k, a threshold for kD 1; : : : ;Koutput neurons. The desired output is NdD .d1; : : : ; dk;: : : ; dK/.

After training, we would like, for all training pat-terns pD 1; : : : ;P from Atrain, the model output to

closely resemble the desired values (target). The train-ing problem is transformed to an optimization one bydefining the error function

Ep D 1

2

KXkD1

.dpk� opk/2 ; (27.10)

where p is the training point index. Ep is the sum ofsquares of errors on the output neurons. During learn-ing we seek to find the weight setting that minimizesEp. This will be done using the gradient-based steepestdescent on Ep,

�wkj D�˛ @Ep@wkjD�˛ @Ep

@.netk/

@.netk/

@wkjD ˛ıokyj ;

(27.11)

where ˛ is a positive learning rate. Note that�@[email protected]/D ıok, which is the generalized trainingsignal on the k-th output neuron. The partial [email protected]/=@wkj is equal to yj (27.9). Furthermore,

ıok D� @[email protected]/

D�@Ep@ok

@[email protected]/

D .dpk� opk/f0

k ;

(27.12)

where f 0

k denotes the derivative of the activation func-tion with respect to netk. For the unipolar sigmoid(27.7), we have f 0

k D ok.1�ok/. For the bipolar sigmoid(27.8), f 0

k D .1=2/.1�o2k/. The rule for updating the j-thweight of the k-th output neuron reads as

wkj D ˛.dpk � opk/f 0

k yj ; (27.13)

where (dpk � opk) f 0

k D ıok is generalized error signalflowing back through all connections ending in the k-the output neuron. Note that if we put f 0

k D 1, we wouldobtain the perceptron learning rule (27.6).

yJ

wKJoK

·

·

·

·

·

·

·

·

·

·

·

·

wKj

K

yj

wkj

w1j

ok

k

y1

w11o1

1

Fig. 27.4 A single-layer ANN

Page 5: Tino SpringerHCI.2015

Artificial Neural Network Models 27.3 Multilayered Feed-Forward ANN Models 459Part

D|27.3

We will now extend the network with another layer,called the hidden layer (Fig. 27.5).

Input to the network is identical with the in-put vector NxD .x1; : : : ; xi; : : : ; xI/ for the hidden layer.The output neurons process as inputs the outputs NyD.y1; : : : ; yj; : : : ; yJ/, yj D f .netj/ from the hidden layer.Hence,

netj DIX

iD1

vjixi : (27.14)

As before, the last (in this case the I-th) input is fixed to�1. Recall that the same holds for the output of the J-thhidden neuron. Activation thresholds for hidden neu-rons are vjI D �j, for jD 1; : : : ; J.

Equations (27.11)–(27.13) describe modification ofweights from the hidden to the output layer. We willnow show how to modify weights from the input to thehidden layer. We would still like to minimize Ep (27.10)through the steepest descent.

The hidden weight vji will be modified as follows

vji D�˛ @Ep@vjiD�˛ @Ep

@.netj/

@.netj/

@vjiD ˛ıyjxi :

(27.15)

Again, �@[email protected]/D ıyj is the generalized trainingsignal on the j-th hidden neuron that should flow on theinput weights. As before, @.netj/=@vji D xi (27.14). Fur-thermore,

ıyj D� @[email protected]/

D�@Ep@yj

@[email protected]/

D�@Ep@yj

f 0

j ;

(27.16)

xI

vJi

vJI

···

···

···

J

xi

vji

wKj

wKJ

w1j

w11v1i

j

x1

v11 y1

yj

yJ

1

··· oK

o1

K

1

Fig. 27.5 A two-layer feed-forward ANN

where f 0

j is the derivative of the activation function inthe hidden layer with respect to netj,

@Ep@yjD�

KXkD1

.dpk� opk/@f .netk/

@yj

D�KX

kD1

.dpk� opk/@f .netk/

@.netk/

@.netk/

@yj: (27.17)

Since f 0

k is the derivative of the output neuron sigmoidwith respect to netk and @.netk/=@yj D wkj (27.9), wehave

@Ep@yjD�

KXkD1

.dpk� opk/f0

kwkj D�KX

kD1

ıokwkj :

(27.18)

Plugging this to (27.16) we obtain

ıyj D

KXkD1

ıokwkj

!f 0

j : (27.19)

Finally, the weights from the input to the hidden layerare modified as follows

�vji D ˛

KX

kD1

ıokwkj

!f 0

j xi : (27.20)

Consider now the general case of m hidden layers. Forthe n-th hidden layer we have

�vnji D ˛ınyjxn�1i ; (27.21)

where

ınyj D

KXkD1

ınC1ok wnC1

kj

!.f nj /

0 ; (27.22)

and .f nj /0 is the derivative of the activation function of

the n-layer with respect to netnj .Often, the learning speed can be improved by using

the so-called momentum term

�wkj.t/ �wkj.t/C��wkj.t� 1/ ;�vji.t/ �vji.t/C��vji.t� 1/ ; (27.23)

where � 2 .0;1i is the momentum rate.Consider a training set

Atrain D f.Nx1; Nd1/.Nx2; Nd2/ : : : .Nxp; Ndp/ : : : .NxP; NdP/g :The back-propagation algorithm for training feed-forward ANNs can be summarized as follows:

Page 6: Tino SpringerHCI.2015

PartD|27.4

460 Part D Neural Networks

� Step 1: Set ˛ 2 .0;1i. Randomly initialize weightsto small values, e.g., in the interval (�0:5; 0:5).Counters and the error are initialized as follows:kD 1, pD 1, ED 0. E denotes the accumulated er-ror across training patterns

EDPX

pD1

Ep ; (27.24)

where Ep is given in (27.10). Set a tolerance thresh-old " for the error. The threshold will be used to stopthe training process.� Step 2: Apply input Nxp and compute the correspond-ing Nyp and Nop.� Step 3: For every output neuron, calculate ıok(27.12), for hidden neuron determine ıyj (27.19).� Step 4: Modify the weights wkj wkjC˛ıokyj andvji vjiC˛ıyjxi.� Step 5: If p< P, set pD pC1 and go to step 2. Oth-erwise go to step 6.� Step 6: Fixing the weights, calculate E. If E < ",stop training, otherwise permute elements of Atrain,set ED 0, pD 1, kD kC 1, and go to step 2.

Consider a feed-forward ANN with fixed weightsand single output unit. It can be considered a real-valued function G on I-dimensional vectorial inputs,

G.Nx/D f

0@ JX

jD1

wjf

IX

iD1

vjixi

!1A :

There has been a series of results showing that sucha parameterized function class is sufficiently rich in thespace of reasonable functions (see, e.g., [27.4]). Forexample, for any smooth function F over a compact do-main and a precision threshold ", for sufficiently largenumber J of hidden units there is a weight setting sothat G is not further away from F than " (in L-2 norm).

When training a feed-forward ANN a key decisionmust be made about how complex the model should be.In other words, how many hidden units J one shoulduse. If J is too small, the model will be too rigid (high

bias) and it will not be able to sufficiently adapt to thedata. However, under different samples from the samedata generating process, the resulting trained modelswill vary relatively little (low variance). On the otherhand, if J is too high, the model will be too complex,modeling even such irrelevant features of the data suchas output noise. The particular data will be interpolatedexactly (low bias), but the variability of fitted modelsunder different training samples from the same processwill be immense. It is, therefore, important to set J to anoptimal value, reflecting the complexity of the data gen-erating process. This is usually achieved by splitting thedata into three disjoint sets – training, validation, andtest sets. Models with different numbers of hidden unitsare trained on the training set, their performance is thenchecked on a held-out validation set. The optimal num-ber of hidden units is selected based on the (smallest)validation error. Finally, the test set is used for inde-pendent comparison of selected models from differentmodel classes.

If the data set is not large enough, one can performsuch a model selection using k-fold cross-validation.The data for model construction (this data would beconsidered training and validation sets in the scenarioabove) is split into k disjoint folds. One fold is selectedas the validation fold, the other k� 1 will be used fortraining. This is repeated k times, yielding k estimatesof the validation error. The validation error is then cal-culated as the mean of those k estimates.

We have described data-based methods for modelselection. Other alternatives are available. For exam-ple, by turning an ANN into a probabilistic model (e.g.,by including an appropriate output noise model), un-der some prior assumptions on weights (e.g., a-priorismall weights are preferred), one can perform Bayesianmodel selection (through model evidence) [27.5].

There are several seminal books on feed-forwardANNs with well-documented theoretical foundationsand practical applications, e.g., [27.3, 6, 7]. We referthe interested reader to those books as good startingpoints as the breadth of theory and applications of feed-forward ANNs is truly immense.

27.4 Recurrent ANN Models

Consider a situation where the associations in the train-ing set we would like to learn are of the following(abstract) form: a! ˛, b! ˇ, b! ˛, b! � , c! ˛,c! � , d! ˛, etc., where the Latin and Greek lettersstand for input and output vectors, respectively. It is

clear that now for one input item there can be differentoutput associations, depending on the temporal con-text in which the training items are presented. In otherwords, the model output is determined not only by theinput, but also by the history of presented items so far.

Page 7: Tino SpringerHCI.2015

Artificial Neural Network Models 27.4 Recurrent ANN Models 461Part

D|27.4

Obviously, the feed-forward ANN model described inthe previous section cannot be used in such cases andthe model must be further extended so that the temporalcontext is properly represented.

The architecturally simplest solution is providedby the so-called time delay neural network (TDNN)(Fig. 27.6). The input window into the past has a finitelength D. If the output is an estimate of the next itemof the input time series, such a network realizes a non-linear autoregressive model of order D.

If we are lucky, even such a simple solution can besufficient to capture the temporal structure hidden in thedata. An advantage of the TDNN architecture is thatsome training methods developed for feed-forward net-works can be readily used. A disadvantage of TDNNnetworks is that fixing a finite order D may not be ade-quate for modeling the temporal structure of the datagenerating source. TDNN enables the feed-forwardANN to see, besides the current input at time t, the otherinputs from the past up to time t�D. Of course, duringthe training, it is now imperative to preserve the order oftraining items in the training set. TDNN has been suc-cessfully applied in many fields where spatial-temporalstructures are naturally present, such as robotics, speechrecognition, etc. [27.8, 9].

In order to extend the ANN architecture so that thevariable (even unbounded) length of input window canbe flexibly considered, we need a different way of cap-turing the temporal context. This is achieved throughthe so-called state space formulation. In this case, wewill need to change our outlook on training. The newarchitectures of this type are known as recurrent neuralnetworks (RNN).

As in feed-forward ANNs, there are connections be-tween the successive layers. In addition, and in contrastto feed-forward ANNs, connections between neurons ofthe same layer are allowed, but subject to a time de-

· · · x(t–D–1) x(t–D)x(t–1)x(t)

Hidden layer

Output layer

Fig. 27.6 TDNN of order D

lay. It also may be possible to have connections froma higher-level layer to a lower layer, again subject toa time delay. In many cases it is, however, more conve-nient to introduce an additional fictional context layerthat contains delayed activations of neurons from theselected layer(s) and represent the resulting RNN archi-tecture as a feed-forward architecture with some fixedone-to-one delayed connections. As an example, con-sider the so-called simple recurrent network (SRN) ofElman [27.10] shown in Fig. 27.7. The output of SRNat time t is given by

o.t/k D f

0@ JX

jD1

mkjy.t/j

1A ;

y.t/j D f

JX

iD1

wjiy.t�1/i C

IXiD1

vjix.t/i

!: (27.25)

The hidden layer constitutes the state of the input-driven dynamical system whose role it is to representthe relevant (with respect to the output) information

y(t–1)x(t)

V

M

W

Unit delayo(t)

y(t)

Fig. 27.7 Schematic depiction of the SRN architecture

ContextInput

Hidden

Unit delay

Output

Fig. 27.8 Schematic depiction of the Jordan’s RNN archi-tecture

Page 8: Tino SpringerHCI.2015

PartD|27.4

462 Part D Neural Networks

about the input history seen so far. The state (as ingeneric state space model) is updated recursively.

Many variations on such architectures with time-delayed feedback loops exist. For example, Jor-dan [27.11] suggested to feed back the outputs asthe relevant temporal context, or Bengio et al. [27.12]mixed the temporal context representations of SRN andthe Jordan network into a single architecture. Schematicrepresentations of these architectures are shown inFigs. 27.8 and 27.9.

Training in such architectures is more complexthan training of feed-forward ANNs. The principalproblem is that changes in weights propagate in timeand this needs to be explicitly represented in the up-date rules. We will briefly mention two approachesto training RNNs, namely back-propagation throughtime (BPTT) [27.13] and real-time recurrent learning(RTRL) [27.14]. We will demonstrate BPTT on a clas-

ContextContext Input

Unit delay

Unit delay

Hidden

Output

Fig. 27.9 Schematic depiction of the Bengio’s RNN archi-tecture

o2(t)o1(t)

m11

v11

v12

m12

w12

w21

m21

v21

w11 w22

y1(t)

m22

v22

y2(t)

x1(t)

1

x2(t)

2

Fig. 27.10 A two-neuron SRN

sification task, where the label of the input sequenceis known only after T time steps (i. e., after T inputitems have been processed). The RNN is unfolded intime to form a feed-forward network with T hidden lay-ers. Figure 27.10 shows a simple two-neuron RNN andFig. 27.11 represents its unfolded form for T D 2 timesteps.

The first input comes at time tD 1 and the last attD T . Activities of context units are initialized at thebeginning of each sequence to some fixed numbers.The unfolded network is then trained as a feed-forwardnetwork with T hidden layers. At the end of the se-quence, the model output is determined as

o.T/k D f

0@ JX

jD1

m.T/kj y.T/j

1A ;

y.t/j D f

JX

iD1

w.t/ji y.t�1/

i CIX

iD1

v.t/ji x.t/i

!: (27.26)

Having the model output enables us to compute the er-ror

E.T/D 1

2

KXkD1

�d.T/k � o.T/k

�2: (27.27)

The hidden-to-outputweights are modified according to

�m.T/kj D�˛

@E.T/

@mkjD ˛ı

.T/k y.T/j ; (27.28)

y2(2)y1(2)

w11(2) w22(2)

w11(1) w22(1)

v11(2) v22(2)

v21(2)

x1(2) x2(2)

v11(1) v22(1)

v21(1)

x1(1) x2(1)

v12(2)

v21(1)

w12(2)

w12(1)

w21(2)

w21(1)

y1(0) y2(0)

y1(1) y2(1)

Fig. 27.11 Two-neuron SRN unfolded in time for T D 2

Page 9: Tino SpringerHCI.2015

Artificial Neural Network Models 27.4 Recurrent ANN Models 463Part

D|27.4

where

ı.T/k D

�d.T/k � o.T/k

�f 0

�net.T/k

�: (27.29)

The other weight updates are calculated as follows

�w.T/hj D�˛

@E.T/

@whjD ˛ı

.T/h y.T�1/

j I

ı.T/h D

KX

kD1

ı.T/k m.T/

kh

!f 0

�net.T/h

�(27.30)

�v.T/ji D�˛@E.T/

@vjiD ˛ı

.T/j x.T/i I

ı.T/j D

KX

kD1

ı.T/k m.T/

kj

!f 0

�net.T/j

�(27.31)

�w.T�1/hj D ˛ı

.T�1/h y.T�2/

j I

ı.T�1/h D

0@ JX

jD1

ı.T�1/j w.T�1/

jh

1A f 0

�net.T�1/

h

�(27.32)

�v.T�1/ji D ˛ı

.T�1/j x.T�1/

i I

ı.T�1/j D

JX

hD1

ı.T�1/h w.T�1/

hj

!f 0

�net.T�1/

j

�;

(27.33)

etc. The final weight updates are the averages of theT partial weight update suggestions calculated on theunfolded network

�whj DPT

tD1 �w.t/hj

Tand �vji D

PTtD1 v.t/ji

T:

(27.34)

For every new training sequence (of possibly differentlength T) the network is unfolded to the desired lengthand the weight update process is repeated. In somecases (e.g., continual prediction on time series), it isnecessary to set the maximum unfolding length L thatwill be used in every update step. Of course, in suchcases we can lose vital information from the past. Thisproblem is eliminated in the RTRL methodology.

Consider again the SRN architecture in Fig. 27.6.In RTRL the weights are updated on-line, i. e., at every

time step t

�w.t/kj D�˛

@E.t/

@w.t/kj

;

�v.t/ji D�˛@E.t/

@v.t/ji

;

�m.t/jl D�˛

@E.t/

@m.t/jl

: (27.35)

The updates of hidden-to-output weights are straight-forward

�m.t/kj D ˛ı

.t/k y.t/j D ˛

�d.t/k � o.t/k

�f 0

k

�net.t/k

�y.t/j :

(27.36)

For the other weights we have

�v.t/ji D ˛

KXkD1

ı.t/k

JXhD1

[email protected]/h@vji

!;

�w.t/ji D ˛

KXkD1

ı.t/k

JXhD1

[email protected]/h@wji

!; (27.37)

where

@y.t/h@vjiD f 0

�net.t/h

� x.t/i ıKron.jh C

JXlD1

[email protected]�1/

l

@vji

!

(27.38)

@y.t/h@wjiD f 0

�net.t/h

� x.t/i ıKron.jh C

JXlD1

[email protected]�1/

l

@wji

!;

(27.39)

and ıKron.jh is the Kronecker delta (ıKron.jh D 1, if jD h;ıKron.jh D 0 otherwise). The partial derivatives requiredfor the weight updates can be recursively updated us-ing (27.37)–(27.39). To initialize training, the partialderivatives at tD 0 are usually set to 0.

There is a well-known problem associated withgradient-based parameter fitting in recurrent networks(and, in fact, in any parameterized state space modelsof similar form) [27.15]. In order to latch an importantpiece of past information for future use, the state-transition dynamics (27.25) should have an attractiveset.

However, in the neighborhood of such an attractiveset, the derivatives of the dynamic state-transition map

Page 10: Tino SpringerHCI.2015

PartD|27.5

464 Part D Neural Networks

vanish. Vanishingly small derivatives cannot be reliablypropagated back through time in order to form a usefullatching set. This is known as the information latchingproblem. Several suggestions for dealing with informa-tion latching problem have been made, e.g., [27.16].The most prominent include long short term memory(LSTM) RNN [27.17] and reservoir computation mod-els [27.18].

LSTM models operate with a specially designedformal neuron model that contains so-called gate units.The gates determine when the input is significant (interms of the task given) to be remembered, whetherthe neuron should continue to remember the value, andwhen the value should be output. The LSTM architec-ture is especially suitable for situations where there arelong time intervals of unknown size between impor-tant events. LSTM models have been shown to providesuperior results over traditional RNNs in a variety ofapplications (e.g., [27.19, 20]).

Reservoir computation models try to avoid theinformation latching problem by fixing the state-transition part of the RNN. Only linear readout fromthe state activations (hidden recurrent layer) producingthe output is fit to the data. The state space with the as-

sociated dynamic state transition structure is called thereservoir. The main idea is that the reservoir should besufficiently complex so as to capture a large number ofpotentially useful features of the input stream that canbe then exploited by the simple readout.

The reservoir computing models differ in how thefixed reservoir is constructed and what form the readouttakes. For example, echo state networks (ESN) [27.21]have fixed RNN dynamics (27.25), but with a lin-ear hidden-to-output layer map. Liquid state machines(LSM) [27.22] also have (mostly) linear readout, butthe reservoirs are realized through the dynamics of a setof coupled spiking neuron models. Fractal predictionmachines (FPM) [27.23] are reservoir RNN models forprocessing discrete sequences. The reservoir dynamicsis driven by an affine iterative function system and thereadout is constructed as a collection of multinomialdistributions. Reservoir models have been successfullyapplied in many practical applications with competitiveresults, e.g., [27.21, 24, 25].

Several books that are solely dedicated to RNNshave appeared, e.g., [27.26–28] and they containa much deeper elaboration on theory and practice ofRNNs than we were able to provide here.

27.5 Radial Basis Function ANN Models

In this section we will introduce another implemen-tation of the idea of feed-forward ANN. The activa-tions of hidden neurons are again determined by thecloseness of inputs NxD .x1; x2; : : : ; xn/ to weights NcD.c1; c2; : : : ; cn/. Whereas in the feed-forward ANN inSect. 27.3, the closeness is determined by the dot-product of Nx and Nc, followed by the sigmoid activationfunction, in radial basis function (RBF) networks thecloseness is determined by the squared Euclidean dis-tance of Nx and Nc, transferred through the inverse expo-nential. The output of the j-th hidden unit with inputweight vector Ncj is given by

'j.Nx/D exp

���Nx� Ncj��2

�2j

!; (27.40)

where �j is the activation strength parameter of the j-thhidden unit and determines the width of the spherical(un-normalized) Gaussian. The output neurons are usu-ally linear (for regression tasks)

ok.Nx/DJX

jD1

wkj'j.Nx/ : (27.41)

The RBF network in this form can be simply viewedas a form of kernel regression. The J functions 'jform a set of J linearly independent basis functions(e.g., if all the centers Ncj are different) whose span(the set of all their linear combinations) forms a lin-ear subspace of functions that are realizable by thegiven RBF architecture (with given centers Ncj and kernelwidths �j).

For the training of RBF networks, it important thatthe basis functions 'j.Nx/ cover the structure of the in-puts space faithfully. Given a set of training inputs Nxpfrom Atrain D f.Nx1; Nd1/.Nx2; Nd2/ : : : .Nxp; Ndp/ : : : :.NxP; NdP/g,many RBF-ANN training algorithms determine the cen-ters Ncj and widths �j based on the inputs fNx1; Nx2; : : : ; NxPgonly. One can employ different clustering algo-rithms, e.g., k-means [27.29], which attempts toposition the centers among the training inputs sothat the overall sum of (Euclidean) distances be-tween the centers and the inputs they represent (i. e.,the inputs falling in their respective Voronoi com-partments – the set of inputs for which the cur-rent center is the closest among all the centers) isminimized:

Page 11: Tino SpringerHCI.2015

Artificial Neural Network Models 27.6 Self-Organizing Maps 465Part

D|27.6

� Step 1: Set J, the number of hidden units. The op-timum value of Jcan be obtained through a modelselection method, e.g., cross-validation.� Step 2: Randomly select J training inputs that willform the initial positions of the J centers Ncj.� Step 3: At time step t:a) Pick a training input Nx.t/ and find the center Nc.t/

closest to it.b) Shift the center Nc.t/ towards Nx.t/Nc.t/ Nc.t/C �.t/.Nx.t/� Nc.t// ;

where 0 � �.t/� 1 : (27.42)

The learning rate �.t/ usually decreases in timetowards zero. The training is stopped once the cen-ters settle in their positions and move only slightly(some norm of weight updates is below a certainthreshold). Since k-means is guaranteed to find only lo-cally optimal solutions, it is worth re-initializing thecenters and re-running the algorithm several times,keeping the solution with the lowest quantizationerror.

Once the centers are in their positions, it is easy todetermine the RBF widths, and once this is done, the

output weights can be solved using methods of linearregression.

Of course, it is more optimal to position the centerswith respect to both the inputs and target outputs in thetraining set. This can be formulated, e.g., as a gradientdescent optimization. Furthermore, covering of the in-put space with spherical Gaussian kernels may not beoptimal, and algorithms have been developed for learn-ing of general covariance structures. A comprehensivereview of RBF networks can be found, e.g., in [27.30].

Recently, it was shown that if enough hiddenunits are used, their centers can be set randomlyat very little cost, and determination of the onlyremaining free parameters – output weights – canbe done cheaply and in a closed form through lin-ear regression. Such architectures, known as extremelearning machines [27.31] have shown surprisinglyhigh performance levels. The idea of extreme learn-ing machines can be considered as being analo-gous to the idea of reservoir computation, but inthe static setting. Of course, extreme learning ma-chines can be built using other implementations offeed-forward ANNs, such as the sigmoid networks ofSect. 27.3.

27.6 Self-Organizing Maps

In this section we will introduce ANN models thatlearn without any signal from a teacher, i. e., learningis based solely on training inputs – there are no out-put targets. The ANN architecture designed to operatein this setting was introduced by Kohonen under thename self-organizing map (SOM) [27.32]. This modelis motivated by organization of neuron sensitivities inthe brain cortex.

In Fig. 27.12a we show schematic illustration ofone of the principal organizations of biological neuralnetworks. In the bottom layer (grid) there are recep-tors representing the inputs. Every element of the inputs(each receptor) has forward connections to all neuronsin the upper layer representing the cortex. The neuronsare organized spatially on a grid. Outputs of the neuronsrepresent activation of the SOM network. The neurons,besides receiving connections from the input recep-tors, have a lateral interconnection structure amongthemselves, with connections that can be excitatory, orinhibitory. In Fig. 27.12b we show a formal SOM archi-tecture – neurons spatially organized on a grid receiveinputs (elements of input vectors) through connectionswith synaptic weights.

A particular feature of the SOM is that it can mapthe training set on the neuron grid in a manner thatpreserves the training set’s topology – two input pat-terns close in the input space will activate neurons mostthat are close on the SOM grid. Such topological map-ping of inputs (feature mapping) has been observed inbiological neural networks [27.32] (e.g., visual maps,orientation maps of visual contrasts, or auditory maps,frequency maps of acoustic stimuli).

Teuvo Kohonen presented one possible realizationof the Hebb rule that is used to train SOM. Input

a) b)

Fig. 27.12a,b Schematic representation of the SOM ANN architec-tures

Page 12: Tino SpringerHCI.2015

PartD|27.6

466 Part D Neural Networks

weights of the neurons are initialized as small ran-dom numbers. Consider a training set of inputs, Atrain DfNxpgPpD1 and linear neurons

oi DmX

jD1

wijxj D Nwi Nx ; (27.43)

where m is the input dimension and iD 1; : : : ; n. Train-ing inputs are presented in random order. At each train-ing step, we find the (winner) neuron with the weightvector most similar to the current input Nx. The mea-sure of similarity can be based on the dot product, i. e.,the index of the winner neuron is i� D argmax.wT

i x/,or the (Euclidean) distance i� D argmini kx�wik. Af-ter identifying the winner the learning continues byadapting the winner’s weights along with the weightsall its neighbors on the neuron grid. This will ensurethat nearby neurons on the grid will eventually repre-sent similar inputs in the input space. This is moderatedby a neighborhood function h.i�; i/ that, given a winnerneuron index i�, quantifies how many other neurons onthe grid should be adapted

wi .tC 1/D wi .t/C˛ .t/ h �i�; i� .x .t/�wi .t// :

(27.44)

The learning rate ˛.t/ 2 .0;1/ decays in time as 1=t, orexp.�kt/, where k is a positive time scale constant. Thisensures convergence of the training process. The sim-plest form of the neighborhood function operates withrectangular neighborhoods,

h.i�; i/D(1; if dM.i�; i/� � .t/

0; otherwise ;(27.45)

where dM.i�; i/ represents the 2�.t/ (Manhattan) dis-tance between neurons i� and i on the map grid. Theneighborhood size 2� .t/ should decrease in time, e.g.,through an exponential decay as or exp.�qt/, with timescale q> 0. Another often used neighborhood functionis the Gaussian kernel

h.i�; i/D exp��d

2E .i

�; i/

�2 .t/

�; (27.46)

where dE.i�; i/ is the Euclidean distance between i� andi on the grid, i. e., dE.i�; i/D kri� � rik, where ri is theco-ordinate vector of the i-th neuron on the grid SOM.

Training of SOM networks can be summarized as fol-lows:

� Step 1: Set ˛0, �0 and tmax (maximum numberof iterations). Randomly (e.g., with uniform distri-bution) generate the synaptic weights (e.g., from(�0:5; 0:5)). Initialize the counters: tD 0, pD 1; tindexes time steps (iterations) and p is the input pat-tern index.� Step 2: Take input Nxp and find the correspondingwinner neuron.� Step 3: Update the weights of the winner and itstopological neighbors on the grid (as determined bythe neighborhood function). Increment t.� Step 4: Update ˛ and �.� Step 5: If p< P, set p pC 1, go to step 2 (wecan also use randomized selection), otherwise go tostep 6.� Step 6: If tD tmax, finish the training process. Oth-erwise set pD 1 and go to step 2. A new trainingepoch begins.

The SOM network can be used as a tool for non-linear data visualization (grid dimensions 1, 2, or 3).In general, SOM implements constrained vector quanti-zation, where the codebook vectors (vector quantizationcenters) cannot move freely in the data space dur-ing adaptation, but are constrained to lie on a lowerdimensional manifold � in the data space. The dimen-sionality of � is equal to the dimensionality of theneural grid. The neural grid can be viewed as a dis-cretized version of the local co-ordinate system # (e.g.,computer screen) and the weight vectors in the dataspace (connected by the neighborhood structure on theneuron grid) as its image in the data space. In this in-terpretation, the neuron positions on the grid representco-ordinate functions (in the sense of differential ge-ometry) mapping elements of the manifold � to thecoordinate system # . Hence, the SOM algorithm canalso be viewed as one particular implementation ofmanifold learning.

There have been numerous successful applicationsof SOM in a wide variety of applications, e.g., in imageprocessing, computer vision, robotics, bioinformatics,process analysis, and telecommunications. A good sur-vey of SOM applications can be found, e.g., in [27.33].SOMs have also been extended to temporal domains,mostly by the introduction of additional feedback con-nections, e.g., [27.34–37]. Such models can be used fortopographic mapping or constrained clustering of timeseries data.

Page 13: Tino SpringerHCI.2015

Artificial Neural Network Models 27.7 Recursive Neural Networks 467Part

D|27.7

27.7 Recursive Neural Networks

In many application domains, data are naturally or-ganized in structured form, where each data item iscomposed of several components related to each otherin a non-trivial way, and the specific nature of the task tobe performed is strictly related not only to the informa-tion stored at each component, but also to the structureconnecting the components. Examples of structureddata are parse trees obtained by parsing sentences innatural language, and the molecular graph describinga chemical compound.

Recursive neural networks (RecNN) [27.38, 39] areneural network models that are able to directly pro-cess structured data, such as trees and graphs. For thesake of presentation, here we focus on positional trees.Positional trees are trees for which each child has an as-sociated index, its position, with respect to the siblings.Let us understand how RecNN is able to process a treeby analogy with what happens when unfolding a RNNwhen processing a sequence, which can be understoodas a special case of tree where each node v possessesa single child.

In Fig. 27.13 (top) we show the unfolding in timeof a sequence when considering a graphical model (re-

Data structure (binary tree) Recursive network

Encoding network

Frontier states

d

b

t = 1

t = 2

t = 3

t = 4

t = 5

c

a

e exv

yv

ov

a

b c

eed

qL–1 qR

–1

Recursive network

Sequence (list)

a)

b)

Encoding network: time unfoldingxv

yv

ov

q –1

Fig. 27.13a,b Generation of the encoding network (a) for a sequence and (b) a tree. Initial states are represented bysquared nodes

cursive network) representing, for a generic node v, thefunctional dependencies among the input informationxv, the state variable (hidden node) yv, and the outputvariable ov. The operator q�1 represents the shift oper-ator in time (unit time delay), i. e., q�1yt D yt�1, whichapplied to node v in our framework returns the child ofnode v. At the bottom of Fig. 27.13 we have reportedthe unfolding of a binary tree, where the recursive net-work uses a generalization of the shift operator, whichgiven an index i and a variable associated to a vertexv returns the variable associated to the i-th child of v,i. e., q�1

i yv D ychiŒv�. So, while in RNN the network isunfolded in time, in RecNN the network is unfolded onthe structure. The result of unfolding, in both cases, isthe encoding network. The encoding network for the se-quence specifies how the components implementing thedifferent parts of the recurrent network (e.g., each nodeof the recurrent network could be instantiated by a layerof neurons or by a full feed-forward neural networkwith hidden units) need to be interconnected. In the caseof the tree, the encoding network has the same seman-tics: this time, however, a set of parameters (weights)for each child should be considered, leading to a net-

Page 14: Tino SpringerHCI.2015

PartD|27.7

468 Part D Neural Networks

yv,R

xv,R

yv,R,Ryv,R,L

xv,L

yv,L

yv

20 40

b)

a)

60 80

h2179

h647

h873h1501

h1701

h1179

h379h1549

h2263

h573

h563

100

100

80

60

40

20

60 80 120100

s1911

s1199

s1319

s1227

s1495

s2201

s2377

s1555

s2111

s171

s1125

s1609

s177

s901

s339

s669

100

80

60

40

20

Fig. 27.15a,b Schematic illustration of a principal organization ofbiological self-organizing neural network (a) and its formal coun-terpart SOM ANN architecture (b)

Fig. 27.14 The causality style of computation induced bythe use of recursive networks is made explicit by usingnested boxes to represent the recursive dependencies of thehidden variable associated to the root of the tree J

work that, given a node v, can be described by theequations

o.v/k D f

0@ JX

jD1

mkjy.v/j

1A ;

y.v/j D f

dX

sD1

JXiD1

wsjiy

.chsŒv�/i C

IXiD1

vjix.v/i

!;

where d is the maximum number of children an inputnode can have, and weights ws

ji are indexed on the s-thchild. Note that it is not difficult to generalize all thelearning algorithms devised for RNN to these extendedequations.

It should be remarked that recursive networksclearly introduce a causal style of computation, i. e.,the computation of the hidden and output variables fora vertex v only depends on the information attached to vand the hidden variables of the children of v. This de-pendence is satisfied recursively by all v’s descendantsand is clearly shown in Fig. 27.14. In the figure, nestedboxes are used to make explicit the recursive depen-dencies among hidden variables that contribute to thedetermination of the hidden variable yv associated to theroot of the tree.

Although an encoding network can be generatedfor a directed acyclic graph (DAG), this style of com-putation limits the discriminative ability of RecNNto the class of trees. In fact, the hidden state is notable to encode information about the parents of nodes.The introduction of contextual processing, however, al-lows us to discriminate, with some specific exceptions,among DAGs [27.40]. Recently, Micheli [27.41] alsoshowed how contextual processing can be used to ex-tend RecNN to the treatment of cyclic graphs.

The same idea described above for supervised neu-ral networks can be adapted to unsupervised mod-els, where the output value of a neuron typicallyrepresents the similarity of the weight vector associ-ated to the neuron with the input vector. Specifically,in [27.37] SOMs were extended to the processing ofstructured data (SOM-SD). Moreover, a general frame-work for self-organized processing of structured data

Page 15: Tino SpringerHCI.2015

Artificial Neural Network Models 27.8 Conclusion 469Part

D|27.8

was proposed in [27.42]. The key concepts introducedare:

i) The explicit definition of a representation space Requipped with a similarity measure dR.; / to evalu-ate the similarity between two hidden states.

ii) The introduction of a general representation func-tion, denoted rep(/, which transforms the activationof the map for a given input into an hidden state rep-resentation.

In these models, each node v of the input struc-ture is represented by a tuple ŒNxv; Nrv1 ; : : : ; Nrvd �, whereNxv is a real-valued vectorial encoding of the infor-mation attached to vertex v, and Nrvi are real-valued

vectorial representations of hidden states returned bythe rep(/ function when processing the activationof the map for the i-th neighbor of v. Each neu-ron nj in the map is associated to a weight vectorŒ Nwj; Nc1j ; : : : ; Ncdj �. The computation of the winner neu-ron is based on the joint contribution of the similaritymeasures dx.; / for the input information, and dR.; /for the hidden states, i. e., the internal representa-tions. Some parts of a SOM-SD map trained on DAGsrepresenting visual patterns are shown in Fig. 27.15.Even in this case the style of computation is causal,ruling out the treatment of undirected and/or cyclicgraphs. In order to cope with general graphs, recentlya new model, named GraphSOM [27.43], was pro-posed.

27.8 Conclusion

The field of artificial neural networks (ANN) hasgrown enormously in the past 60 years. There aremany journals and international conferences specifi-cally devoted to neural computation and neural net-work related models and learning machines. The fieldhas gone a long way from its beginning in the formof simple threshold units existing in isolation (e.g.,the perceptron, Sect. 27.2) or connected in circuits.Since then we have learnt how to generalize suchnetworks as parameterized differentiable models of var-ious sorts that can be fit to data (trained), usuallyby transforming the learning task into an optimizationone.

ANN models have found numerous successfulpractical applications in many diverse areas of sci-ence and engineering, such as astronomy, biology,finance, geology, etc. In fact, even though basic feed-forward ANN architectures were introduced long timeago, they continue to surprise us with successful ap-plications, most recently in the form of deep net-works [27.44]. For example, a form of deep ANNrecently achieved the best performance on a well-known benchmark problem – the recognition of hand-written digits [27.45]. This is quite remarkable, sincesuch a simple ANN architecture trained in a purelydata driven fashion was able to outperform the currentstate-of-art techniques, formulated in more sophisti-

cated frameworks and possibly incorporating domainknowledge.

ANN models have been formulated to operate insupervised (e.g., feed-forward ANN, Sect. 27.3; RBFnetworks, Sect. 27.5), unsupervised (e.g., SOMmodels,Sect. 27.6), semi-supervised, and reinforcement learn-ing scenarios and have been generalized to processinputs that are much more general than simple vectordata of fixed dimensionality (e.g., the recurrent and re-cursive networks discussed in Sects. 27.4 and 27.7). Ofcourse, we were not able to cover all important de-velopments in the field of ANNs. We can only hopethat we have sufficiently motivated the interested readerwith the variety of modeling possibilities based on theidea of interconnected networks of formal neurons,so that he/she will further consult some of the many(much more comprehensive) monographs on the topic,e.g., [27.3, 6, 7].

We believe that ANN models will continue toplay an important role in modern computational in-telligence. Especially the inclusion of ANN-like mod-els in the field of probabilistic modeling can providetechniques that incorporate both explanatory model-based and data-driven approaches, while preservinga much fuller modeling capability through operat-ing with full distributions, instead of simple pointestimates.

Page 16: Tino SpringerHCI.2015

PartD|27

470 Part D Neural Networks

References

27.1 F. Rosenblatt: The perceptron, a probabilisticmodelfor information storage and organization in thebrain, Psychol. Rev. 62, 386–408 (1958)

27.2 D.E. Rumelhart, G.E. Hinton, R.J. Williams: Learn-ing internal representations by error propagation.In: Parallel Distributed Processing: Explorations inthe Microstructure of Cognition. Vol. 1 Founda-tions, ed. by D.E. Rumelhart, J.L. McClelland (MITPress/Bradford Books, Cambridge 1986) pp. 318–363

27.3 J. Zurada: Introduction to Artificial Neural Systems(West Publ., St. Paul 1992)

27.4 K. Hornik, M. Stinchocombe, H. White: Multilayerfeedforward networks are universal approxima-tors, Neural Netw. 2, 359–366 (1989)

27.5 D.J.C. MacKay: Bayesian interpolation, Neural Com-put. 4(3), 415–447 (1992)

27.6 S. Haykin: Neural Networks and Learning Machines(Prentice Hall, Upper Saddle River 2009)

27.7 C. Bishop: Neural Networks for Pattern Recognition(Oxford Univ. Press, Oxford 1995)

27.8 T. Sejnowski, C. Rosenberg: Parallel networks thatlearn to pronounce English text, Complex Syst. 1,145–168 (1987)

27.9 A. Weibel: Modular construction of time-delayneural networks for speech recognition, NeuralComput. 1, 39–46 (1989)

27.10 J.L. Elman: Finding structure in time, Cogn. Sci. 14,179–211 (1990)

27.11 M.I. Jordan: Serial order: A parallel distributed pro-cessing approach. In: Advances in ConnectionistTheory, ed. by J.L. Elman, D.E. Rumelhart (Erlbaum,Hillsdale 1989)

27.12 Y. Bengio, R. Cardin, R. DeMori: Speaker indepen-dent speech recognition with neural networks andspeech knowledge. In: Advances in Neural Infor-mation Processing Systems II, ed. by D.S. Touretzky(Morgan Kaufmann, San Mateo 1990) pp. 218–225

27.13 P.J. Werbos: Generalization of backpropagationwith application to a recurrent gas market model,Neural Netw. 1(4), 339–356 (1988)

27.14 R.J. Williams, D. Zipser: A learning algorithm forcontinually running fully recurrent neural net-works, Neural Comput. 1(2), 270–280 (1989)

27.15 Y. Bengio, P. Simard, P. Frasconi: Learning long-term dependencies with gradient descent is dif-ficult, IEEE Trans. Neural Netw. 5(2), 157–166 (1994)

27.16 T. Lin, B.G. Horne, P. Tino, C.L. Giles: Learning long-temr dependencies with NARX recurrent neuralnetworks, IEEE Trans. Neural Netw. 7(6), 1329–1338(1996)

27.17 S. Hochreiter, J. Schmidhuber: Long short-termmemory, Neural Comput. 9(8), 1735–1780 (1997)

27.18 M. Lukosevicius, H. Jaeger: Overview of ReservoirRecipes, Technical Report, Vol. 11 (School of En-gineering and Science, Jacobs University, Bremen2007)

27.19 A. Graves, M. Liwicki, S. Fernandez, R. Bertolami,H. Bunke, J. Schmidhuber: A novel connection-ist system for improved unconstrained handwritingrecognition, IEEE Trans. Pattern Anal. Mach. Intell.31, 5 (2009)

27.20 S. Hochreiter, M. Heusel, K. Obermayer: Fast model-based protein homology detection without align-ment, Bioinformatics 23(14), 1728–1736 (2007)

27.21 H. Jaeger, H. Hass: Harnessing nonlinearity: pre-dicting chaotic systems and saving energy in wire-less telecommunication, Science 304, 78–80 (2004)

27.22 W. Maass, T. Natschlager, H. Markram: Real-timecomputing without stable states: A new frameworkfor neural computation based on perturbations,Neural Comput. 14(11), 2531–2560 (2002)

27.23 P. Tino, G. Dorffner: Predicting the future of discretesequences from fractal representations of the past,Mach. Learn. 45(2), 187–218 (2001)

27.24 M.H. Tong, A. Bicket, E. Christiansen, G. Cottrell:Learning grammatical structure with echo statenetwork, Neural Netw. 20, 424–432 (2007)

27.25 K. Ishii, T. van der Zant, V. Becanovic, P. Ploger:Identification of motion with echo state network,Proc. OCEANS 2004 MTS/IEEE-TECHNO-OCEAN Conf.,Vol. 3 (2004) pp. 1205–1210

27.26 L. Medsker, L.C. Jain: Recurrent Neural Networks:Design and Applications (CRC, Boca Raton 1999)

27.27 J. Kolen, S.C. Kremer: A Field Guide to DynamicalRecurrent Networks (IEEE, New York 2001)

27.28 D. Mandic, J. Chambers: Recurrent Neural Networksfor Prediction: Learning Algorithms, Architecturesand Stability (Wiley, New York 2001)

27.29 J.B. MacQueen: Some models for classification andanalysis if multivariate observations, Proc. 5thBerkeley Symp. Math. Stat. Probab. (Univ. CaliforniaPress, Oakland 1967) pp. 281–297

27.30 M.D. Buhmann: Radial Basis Functions: Theory andImplementations (Cambridge Univ. Press, Cam-bridge 2003)

27.31 G.-B. Huang, Q.-Y. Zhu, C.-K. Siew: Extreme learn-ing machine: theory and applications, Neurocom-puting 70, 489–501 (2006)

27.32 T. Kohonen: Self-Organizing Maps, Springer Seriesin Information Sciences, Vol. 30 (Springer, Berlin,Heidelberg 2001)

27.33 T. Kohonen, E. Oja, O. Simula, A. Visa, J. Kangas: En-gineering applications of the self-organizing map,Proc. IEEE 84(10), 1358–1384 (1996)

27.34 T. Koskela, M. Varsta, J. Heikkonen, K. Kaski: Re-current SOM with local linear models in time seriesprediction, 6th Eur. Symp. Artif. Neural Netw. (1998)pp. 167–172

27.35 T. Voegtlin: Recursive self-organizing maps, NeuralNetw. 15(8/9), 979–992 (2002)

27.36 M. Strickert, B. Hammer: Merge som for temporaldata, Neurocomputing 64, 39–72 (2005)

Page 17: Tino SpringerHCI.2015

Artificial Neural Network Models References 471Part

D|27

27.37 M. Hagenbuchner, A. Sperduti, A. Tsoi: Self-organizing map for adaptive processing of struc-tured data, IEEE Trans. Neural Netw. 14(3), 491–505(2003)

27.38 A. Sperduti, A. Starita: Supervised neural networksfor the classification of structures, IEEE Trans. Neu-ral Netw. 8(3), 714–735 (1997)

27.39 P. Frasconi, M. Gori, A. Sperduti: A general frame-work for adaptive processing of data structures,IEEE Trans. Neural Netw. 9(5), 768–786 (1998)

27.40 B. Hammer, A. Micheli, A. Sperduti: Universal ap-proximation capability of cascade correlation forstructures, Neural Comput. 17(5), 1109–1159 (2005)

27.41 A. Micheli: Neural network for graphs: A contextualconstructive approach, IEEE Trans. Neural Netw.20(3), 498–511 (2009)

27.42 B. Hammer, A. Micheli, A. Sperduti, M. Strickert:A general framework for unsupervised process-ing of structured data, Neurocomputing 57, 3–35(2004)

27.43 M. Hagenbuchner, A. Sperduti, A.-C. Tsoi: Graphself-organizing maps for cyclic and unboundedgraphs, Neurocomputing 72(7–9), 1419–1430 (2009)

27.44 Y. Bengio, Y. LeCun: Greedy Layer-Wise Training ofDeep Network. In: Advances in Neural InformationProcessing Systems 19, ed. by B. Schölkopf, J. Platt,T. Hofmann (MIT Press, Cambridge 2006) pp. 153–160

27.45 D.C. Ciresan, U. Meier, L.M. Gambardella, J. Schmid-huber: Deep big simple neural nets for handwrittendigit recognition, Neural Comput. 22(12), 3207–3220(2010)