soft computing unit-2 by arun pratap singh

Upload: arunpratapsingh

Post on 03-Jun-2018

238 views

Category:

Documents


3 download

TRANSCRIPT

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    1/74

    UNIT : II

    SOFT COMPUTINGII SEMESTER (MCSE 205)

    PREPARED BY ARUN PRATAP SINGH

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    2/74

    PREPARED BY ARUN PRATAP SINGH 1

    1

    NEURAL NETWORK:

    These networks are simplified models of biological neuron system which is a massivelyparallel distributed processing system made up of highly interconnected neural computingelements. The neural networks have the ability to learn that makes them powerful and flexibleand thereby acquire knowledge and make it available for use. There networks are also called

    neural net or artificial neural networks. In neural network there is no need to devise analgorithm for performing a special task. For real time systems, these networks are also wellsuited due to their computational times and fast response due to their parallel architecture.

    UNIT : II

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    3/74

    PREPARED BY ARUN PRATAP SINGH 2

    2

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    4/74

    PREPARED BY ARUN PRATAP SINGH 3

    3

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    5/74

    PREPARED BY ARUN PRATAP SINGH 4

    4

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    6/74

    PREPARED BY ARUN PRATAP SINGH 5

    5

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    7/74

    PREPARED BY ARUN PRATAP SINGH 6

    6

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    8/74

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    9/74

    PREPARED BY ARUN PRATAP SINGH 8

    8

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    10/74

    PREPARED BY ARUN PRATAP SINGH 9

    9

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    11/74

    PREPARED BY ARUN PRATAP SINGH 10

    10

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    12/74

    PREPARED BY ARUN PRATAP SINGH 11

    11

    ARTIFICIAL NEURAL NETWORK (ANN):

    In computer science and related fields, artificial neural networks (ANNs) are

    computationalmodels inspired by an animal'scentral nervous systems (in particular thebrain)

    which is capable ofmachine learning as well aspattern recognition.Artificial neural networks are

    generally presented as systems of interconnected "neurons" which can compute values from

    inputs.

    For example, a neural network for handwriting recognition is defined by a set of input neuronswhich may be activated by the pixels of an input image. After being weighted and transformed by

    a function (determined by the network's designer), the activations of these neurons are then

    passed on to other neurons. This process is repeated until finally, an output neuron is activated.

    This determines which character was read.

    Like other machine learning methods - systems that learn from data - neural networks have been

    used to solve a wide variety of tasks that are hard to solve using ordinary rule-based

    programming, includingcomputer vision andspeech recognition.

    http://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Statistical_modelhttp://en.wikipedia.org/wiki/Central_nervous_systemhttp://en.wikipedia.org/wiki/Brainhttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Handwriting_recognitionhttp://en.wikipedia.org/wiki/Functionhttp://en.wikipedia.org/wiki/Computer_visionhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Computer_visionhttp://en.wikipedia.org/wiki/Functionhttp://en.wikipedia.org/wiki/Handwriting_recognitionhttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Brainhttp://en.wikipedia.org/wiki/Central_nervous_systemhttp://en.wikipedia.org/wiki/Statistical_modelhttp://en.wikipedia.org/wiki/Computer_science
  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    13/74

    PREPARED BY ARUN PRATAP SINGH 12

    12

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    14/74

    PREPARED BY ARUN PRATAP SINGH 13

    13

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    15/74

    PREPARED BY ARUN PRATAP SINGH 14

    14

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    16/74

    PREPARED BY ARUN PRATAP SINGH 15

    15

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    17/74

    PREPARED BY ARUN PRATAP SINGH 16

    16

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    18/74

    PREPARED BY ARUN PRATAP SINGH 17

    17

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    19/74

    PREPARED BY ARUN PRATAP SINGH 18

    18

    DIFFERENT ACTIVATION FUNCTION:

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    20/74

    PREPARED BY ARUN PRATAP SINGH 19

    19

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    21/74

    PREPARED BY ARUN PRATAP SINGH 20

    20

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    22/74

    PREPARED BY ARUN PRATAP SINGH 21

    21

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    23/74

    PREPARED BY ARUN PRATAP SINGH 22

    22

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    24/74

    PREPARED BY ARUN PRATAP SINGH 23

    23

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    25/74

    PREPARED BY ARUN PRATAP SINGH 24

    24

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    26/74

    PREPARED BY ARUN PRATAP SINGH 25

    25

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    27/74

    PREPARED BY ARUN PRATAP SINGH 26

    26

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    28/74

    PREPARED BY ARUN PRATAP SINGH 27

    27

    SINGLE LAYER PERCEPTRON:

    Inmachine learning,the perceptronis an algorithm forsupervisedclassification of an input into

    one of several possible non-binary outputs. It is a type of linear classifier, i.e. a classification

    algorithm that makes its predictions based on a linear predictor function combining a set of

    weights with the feature vector. The algorithm allows for online learning, in that it processes

    elements in the training set one at a time.

    The perceptron algorithm dates back to the late 1950s; its first implementation, in custom

    hardware, was one of the firstartificial neural networks to be produced.

    http://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Supervised_classificationhttp://en.wikipedia.org/wiki/Classification_(machine_learning)http://en.wikipedia.org/wiki/Linear_classifierhttp://en.wikipedia.org/wiki/Linear_predictor_functionhttp://en.wikipedia.org/wiki/Feature_vectorhttp://en.wikipedia.org/wiki/Online_algorithmhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Online_algorithmhttp://en.wikipedia.org/wiki/Feature_vectorhttp://en.wikipedia.org/wiki/Linear_predictor_functionhttp://en.wikipedia.org/wiki/Linear_classifierhttp://en.wikipedia.org/wiki/Classification_(machine_learning)http://en.wikipedia.org/wiki/Supervised_classificationhttp://en.wikipedia.org/wiki/Machine_learning
  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    29/74

    PREPARED BY ARUN PRATAP SINGH 28

    28

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    30/74

    PREPARED BY ARUN PRATAP SINGH 29

    29

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    31/74

    PREPARED BY ARUN PRATAP SINGH 30

    30

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    32/74

    PREPARED BY ARUN PRATAP SINGH 31

    31

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    33/74

    PREPARED BY ARUN PRATAP SINGH 32

    32

    WINDROW HOFF/DELTA LEARNING RULE:

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    34/74

    PREPARED BY ARUN PRATAP SINGH 33

    33

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    35/74

    PREPARED BY ARUN PRATAP SINGH 34

    34

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    36/74

    PREPARED BY ARUN PRATAP SINGH 35

    35

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    37/74

    PREPARED BY ARUN PRATAP SINGH 36

    36

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    38/74

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    39/74

    PREPARED BY ARUN PRATAP SINGH 38

    38

    is the neuron's activation function

    is the target output

    is the weighted sum of the neuron's inputs

    is the actual output

    is the th input.

    It holds that and .

    The delta rule is commonly stated in simplified form for a neuron with a linear activation

    function as

    While the delta rule is similar to theperceptron's update rule, the derivation is different.

    The perceptron uses theHeaviside step function as the activation function , and

    that means that does not exist at zero, and is equal to zero elsewhere, which

    makes the direct application of the delta rule impossible.

    WINNER-TAKE-ALL LEARNING RULE:

    Winner-take-all is a computational principle applied in computational models of neuralnetworks by whichneurons in a layer compete with each other for activation. In the classical form,only the neuron with the highest activation stays active while all other neurons shut down, howeverother variations that allow more than one neuron to be active do exist, for example the soft winnertake-all, by which a power function is applied to the neurons.

    http://en.wikipedia.org/wiki/Perceptronhttp://en.wikipedia.org/wiki/Heaviside_step_functionhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Neuronhttp://en.wikipedia.org/wiki/Neuronhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Heaviside_step_functionhttp://en.wikipedia.org/wiki/Perceptron
  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    40/74

    PREPARED BY ARUN PRATAP SINGH 39

    39

    In the theory of artificial neural networks,winner-take-all networks are a case of competitive

    learning in recurrent neural networks.Output nodes in the network mutually inhibit each other,

    while simultaneously activating themselves through reflexive connections. After some time, only

    one node in the output layer will be active, namely the one corresponding to the strongest input.

    Thus the network uses nonlinear inhibition to pick out the largest of a set of inputs. Winner-take-all is a general computational primitive that can be implemented using different types of neural

    network models, including both continuous-time and spiking networks (Grossberg, 1973; Oster et

    al. 2009).

    Winner-take-all networks are commonly used in computational models of the brain, particularly

    for distributed decision-making or action selection in the cortex. Important examples include

    hierarchical models of vision (Riesenhuber et al. 1999), and models of selective attention and

    recognition (Carpenter and Grossberg, 1987; Itti et al. 1998). They are also common in artificial

    neural networks and neuromorphic analog VLSI circuits. It has been formally proven that the

    winner-take-all operation is computationally powerful compared to other nonlinear operations,such as thresholding (Maass 2000).

    In many practical cases, there is not only a single neuron which becomes the only active one but

    there are exactly kneurons which become active for a fixed number k. This principle is referred

    to as k-winners-take-all .

    http://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Recurrent_neural_networkhttp://en.wikipedia.org/wiki/Winner-take-all_in_action_selectionhttp://en.wikipedia.org/wiki/Cortex_(anatomy)http://en.wikipedia.org/wiki/Cortex_(anatomy)http://en.wikipedia.org/wiki/Winner-take-all_in_action_selectionhttp://en.wikipedia.org/wiki/Recurrent_neural_networkhttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Artificial_neural_network
  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    41/74

    PREPARED BY ARUN PRATAP SINGH 40

    40

    LINEAR SEPARABILITY:

    Linear separability is an important concept in neural networks. The idea is to check if you can

    separate points in an n-dimensional space using only n-1 dimensions.

    Lost it? Heres a simpler explanation.

    One Dimension

    Lets say youre on a number line. You take any two numbers. Now, there are two possibilities:

    1. You choose two different numbers

    2. You choose the same number

    If you choose two different numbers, you can always find another number between them. This

    number separates the two numbers you chose.

    So, you say that these two numbers are linearly separable.

    But, if both numbers are the same, you simply cannot separate them. Theyre the same. So,

    theyre linearly inseparable. (Not just linearly, theyre arent separable at all. You cannotseparate something from itself)

    Two Dimensions

    On extending this idea to two dimensions, some more possibilities come into existence. Consider

    the following:

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    42/74

    PREPARED BY ARUN PRATAP SINGH 41

    41

    Here, were like to seperate the point (1,1) from the other points. You can see that there exists a

    line that does this. In fact, there exist infinite such lines. So, these two classes of points are

    linearly separable. The first class consists of the point (1,1) and the other class has (0,1), (1,0)

    and (0,0).

    Now consider this:

    In this case, you just cannot use one single line to separate the two classes (one containing the

    black points and one containing the red points). So, they are linearly inseparable.

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    43/74

    PREPARED BY ARUN PRATAP SINGH 42

    42

    Three dimensions

    Extending the above example to three dimensions. You need a plane for separating the two

    classes.

    The dashed plane separates the red point from the other blue points. So its linearly separable. If

    bottom right point on the opposite side was red too, it would become linearly inseparable .

    Extending to n dimensions

    Things go up to a lot of dimensions in neural networks. So to separate classes in n-dimensions,

    you need an n-1 dimensional hyperplane.

    Multilayer Perceptron Neural Network Model

    The following diagram illustrates a perceptron network with three layers:

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    44/74

    PREPARED BY ARUN PRATAP SINGH 43

    43

    This network has an input layer(on the left) with three neurons, one hidden layer(in themiddle) with three neurons and an output layer(on the right) with three neurons.

    There is one neuron in the input layer for each predictor variable. In the case of categoricalvariables, N-1 neurons are used to represent the Ncategories of the variable.

    Input LayerA vector of predictor variable values (x1...xp) is presented to the input layer. Theinput layer (or processing before the input layer) standardizes these values so that the range ofeach variable is -1 to 1. The input layer distributes the values to each of the neurons in thehidden layer. In addition to the predictor variables, there is a constant input of 1.0, calledthe biasthat is fed to each of the hidden layers; the bias is multiplied by a weight and added tothe sum going into the neuron.

    Hidden LayerArriving at a neuron in the hidden layer, the value from each input neuron ismultiplied by a weight (wji), and the resulting weighted values are added together producing acombined value uj. The weighted sum (uj) is fed into a transfer function, , which outputs avalue hj. The outputs from the hidden layer are distributed to the output layer.

    Output LayerArriving at a neuron in the output layer, the value from each hidden layerneuron is multiplied by a weight (wkj), and the resulting weighted values are added togetherproducing a combined value vj. The weighted sum (vj) is fed into a transfer function, , whichoutputs a value yk. The yvalues are the outputs of the network.

    If a regression analysis is being performed with a continuous target variable, then there is asingle neuron in the output layer, and it generates a single y value. For classification problems

    with categorical target variables, there are Nneurons in the output layer producing Nvalues,one for each of the Ncategories of the target variable.

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    45/74

    PREPARED BY ARUN PRATAP SINGH 44

    44

    MULTILAYER PERCEPTRON ARCHITECTURE:

    The network diagram shown above is a full-connected, three layer, feed-forward, perceptronneural network. Fully connected means that the output from each input and hidden neuron isdistributed to all of the neurons in the following layer. Feed forward means that the values onlymove from input to hidden to output layers; no values are fed back to earlier layers (a Recurrent

    Network allows values to be fed backward).

    All neural networks have an input layer and an output layer, but the number of hidden layers mayvary. Here is a diagram of a perceptron network with two hidden layers and four total layers:

    When there is more than one hidden layer, the output from one hidden layer is fed into the nexthidden layer and separate weights are applied to the sum going into each layer.

    Training Multilayer Perceptron Networks

    The goal of the training process is to find the set of weight values that will cause the output fromthe neural network to match the actual target values as closely as possible. There are severalissues involved in designing and training a multilayer perceptron network:

    Selecting how many hidden layers to use in the network. Deciding how many neurons to use in each hidden layer. Finding a globally optimal solution that avoids local minima.

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    46/74

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    47/74

    PREPARED BY ARUN PRATAP SINGH 46

    46

    This picture is highly simplified because it represents only a single weight value (on the horizontalaxis). With a typical neural network, you would have a 200-dimension, rough surface with manylocal valleys.

    Optimization methods such as steepest descent and conjugate gradient are highly susceptible to

    finding local minima if they begin the search in a valley near a local minimum. They have no abilityto see the big picture and find the global minimum.

    Several methods have been tried to avoid local minima. The simplest is just to try a number ofrandom starting points and use the one with the best value. A more sophisticated techniquecalled simulated annealingimproves on this by trying widely separated random values and thengradually reducing (cooling) the random jumps in the hope that the location is getting closer tothe global minimum.

    DTREG uses the Nguyen-Widrow algorithm to select the initial range of starting weight values. Itthen uses the conjugate gradient algorithm to optimize the weights. Conjugate gradient usuallyfinds the optimum weights quickly, but there is no guarantee that the weight values it finds are

    globally optimal. So it is useful to allow DTREG to try the optimization multiple times with differentsets of initial random weight values. The number of tries allowed is specified on the MultilayerPerceptron property page.

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    48/74

    PREPARED BY ARUN PRATAP SINGH 47

    47

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    49/74

    PREPARED BY ARUN PRATAP SINGH 48

    48

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    50/74

    PREPARED BY ARUN PRATAP SINGH 49

    49

    MADALINE :

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    51/74

    PREPARED BY ARUN PRATAP SINGH 50

    50

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    52/74

    PREPARED BY ARUN PRATAP SINGH 51

    51

    MADALINE (Many ADALINE[1]) is a three-layer (input, hidden, output), fully connected, feed-

    forward artificial neural network architecture for classification that uses ADALINE units in its

    hidden and output layers, i.e. its activation function is thesign function.The three-layer network

    uses memistors.Three different training algorithms for MADALINE networks, which cannot be

    learned using backpropagation because the sign function is not differentiable, have been

    suggested, called Rule I, Rule II and Rule III. The first of these dates back to 1962 and cannot

    adapt the weights of the hidden-output connection. The second training algorithm improved on

    Rule I and was described in 1988. The third "Rule" applied to a modified network

    withsigmoid activations instead of signum; it was later found to be equivalent to backpropagation.

    The Rule II training algorithm is based on a principle called "minimal disturbance". It proceeds by

    looping over training examples, then for each example, it:

    finds the hidden layer unit (ADALINE classifier) with the lowest confidence in its prediction,

    tentatively flips the sign of the unit,

    accepts or rejects the change based on whether the network's error is reduced,

    stops when the error is zero.

    http://en.wikipedia.org/wiki/Madaline#cite_note-winter-1http://en.wikipedia.org/wiki/Madaline#cite_note-winter-1http://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Statistical_classificationhttp://en.wikipedia.org/wiki/ADALINEhttp://en.wikipedia.org/wiki/Sign_functionhttp://en.wikipedia.org/wiki/Memistorhttp://en.wikipedia.org/wiki/Backpropagationhttp://en.wikipedia.org/wiki/Sigmoid_functionhttp://en.wikipedia.org/wiki/Sigmoid_functionhttp://en.wikipedia.org/wiki/Backpropagationhttp://en.wikipedia.org/wiki/Memistorhttp://en.wikipedia.org/wiki/Sign_functionhttp://en.wikipedia.org/wiki/ADALINEhttp://en.wikipedia.org/wiki/Statistical_classificationhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Madaline#cite_note-winter-1
  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    53/74

    PREPARED BY ARUN PRATAP SINGH 52

    52

    Additionally, when flipping single units' signs does not drive the error to zero for a particular

    example, the training algorithm starts flipping pairs of units' signs, then triples of units, etc.

    DIFFERENCE BETWEEN HUMAN BRAIN AND ANN:

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    54/74

    PREPARED BY ARUN PRATAP SINGH 53

    53

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    55/74

    PREPARED BY ARUN PRATAP SINGH 54

    54

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    56/74

    PREPARED BY ARUN PRATAP SINGH 55

    55

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    57/74

    PREPARED BY ARUN PRATAP SINGH 56

    56

    BACK PROPAGATION:

    Backpropagation, an abbreviation for "backward propagation of errors", is a common method of

    trainingartificial neural networks used in conjunction with anoptimization method such asgradient

    descent.The method calculates the gradient of aloss function with respects to all the weights in

    the network. The gradient is fed to the optimization method which in turn uses it to update the

    weights, in an attempt to minimize the loss function.

    Backpropagation requires a known, desired output for each input value in order to calculate the

    loss function gradient. It is therefore usually considered to be a supervised learning method,

    although it is also used in some unsupervised networks such as autoencoders. It is a

    generalization of the delta rule to multi-layered feedforward networks,made possible by using

    the chain rule to iteratively compute gradients for each layer. Backpropagation requires that

    theactivation function used by theartificial neurons (or "nodes") bedifferentiable.

    http://en.wikipedia.org/wiki/Artificial_neural_networkshttp://en.wikipedia.org/wiki/Mathematical_optimizationhttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Loss_functionhttp://en.wikipedia.org/wiki/Supervised_learninghttp://en.wikipedia.org/wiki/Unsupervised_learninghttp://en.wikipedia.org/wiki/Autoencoderhttp://en.wikipedia.org/wiki/Delta_rulehttp://en.wikipedia.org/wiki/Feedforward_neural_networkhttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Feedforward_neural_networkhttp://en.wikipedia.org/wiki/Delta_rulehttp://en.wikipedia.org/wiki/Autoencoderhttp://en.wikipedia.org/wiki/Unsupervised_learninghttp://en.wikipedia.org/wiki/Supervised_learninghttp://en.wikipedia.org/wiki/Loss_functionhttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Mathematical_optimizationhttp://en.wikipedia.org/wiki/Artificial_neural_networks
  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    58/74

    PREPARED BY ARUN PRATAP SINGH 57

    57

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    59/74

    PREPARED BY ARUN PRATAP SINGH 58

    58

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    60/74

    PREPARED BY ARUN PRATAP SINGH 59

    59

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    61/74

    PREPARED BY ARUN PRATAP SINGH 60

    60

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    62/74

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    63/74

    PREPARED BY ARUN PRATAP SINGH 62

    62

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    64/74

    PREPARED BY ARUN PRATAP SINGH 63

    63

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    65/74

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    66/74

    PREPARED BY ARUN PRATAP SINGH 65

    65

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    67/74

    PREPARED BY ARUN PRATAP SINGH 66

    66

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    68/74

    PREPARED BY ARUN PRATAP SINGH 67

    67

    DERIVATION OF ERROR BACK PROPAGATION ALGORITHM (EBPA) :

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    69/74

    PREPARED BY ARUN PRATAP SINGH 68

    68

    Derivation-

    Since backpropagation uses the gradient descent method, one needs to calculate the derivative

    of the squared error function with respect to the weights of the network. The squared error function

    is:

    ,

    = the squared error

    = target output

    = actual output of the output neuron[note 2]

    http://en.wikipedia.org/wiki/Backpropagation#cite_note-4http://en.wikipedia.org/wiki/Backpropagation#cite_note-4http://en.wikipedia.org/wiki/Backpropagation#cite_note-4http://en.wikipedia.org/wiki/Backpropagation#cite_note-4
  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    70/74

    PREPARED BY ARUN PRATAP SINGH 69

    69

    (The factor of is included to cancel the exponent when differentiating.) Therefore the error, ,

    depends on the output . However, the output depends on the weighted sum of all its input:

    = the number of input units to the neuron

    = the -th weight

    = the -th input value to the neuron

    The above formula only holds true for a neuron with a linear activation function (that is the outputis solely the weighted sum of the input). In general, a non-linear, differentiableactivation

    function, , is used. Thus, more correctly:

    This lays the groundwork for calculating the partial derivative of the error with respect to aweight using thechain rule:

    = How the error changes when the weights are changed

    = How the error changes when the output is changed

    = How the output changes when the weighted sum changes

    http://en.wikipedia.org/wiki/Non-linearhttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Partial_derivativehttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Partial_derivativehttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Non-linear
  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    71/74

    PREPARED BY ARUN PRATAP SINGH 70

    70

    = How the weighted sum changes as the weights change

    Since the weighted sum is just the sum over all products , therefore the partial

    derivative of the sum with respect to a weight is the just the corresponding input . Similarly,

    the partial derivative of the sum with respect to an input value is just the weight :

    The derivative of the output with respect to the weighted sum is simply the derivative ofthe activation function :

    This is the reason why backpropagation requires the activation function to be differentiable.A

    commonly used activation function is thelogistic function:

    which has a nice derivative of:

    For example purposes, assume the network uses a logistic activation function, in which case the

    derivative of the output with respect to the weighted sum is the same as the derivative of

    the logistic function:

    Finally, the derivative of the error with respect to the output is:

    http://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Logistic_functionhttp://en.wikipedia.org/wiki/Logistic_functionhttp://en.wikipedia.org/wiki/Differentiable
  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    72/74

    PREPARED BY ARUN PRATAP SINGH 71

    71

    Putting it all together:

    If one were to use a different activation function, the only difference would be the term

    will be replaced by the derivative of the newly chosen activation function.

    To update the weight using gradient descent, one must choose a learning rate, . The change

    in weight after learning then would be the product of the learning rate and the gradient:

    For a linear neuron, the derivative of the activation function is 1, which yields:

    This is exactly the delta rule forperceptron learning,which is why the backpropagation algorithm

    is a generalization of the delta rule. In backpropagation and perceptron learning, when the

    output matches the desired output , the change in weight would be zero, which is exactly

    what is desired.

    http://en.wikipedia.org/wiki/Perceptron#Learning_algorithmhttp://en.wikipedia.org/wiki/Perceptron#Learning_algorithm
  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    73/74

    PREPARED BY ARUN PRATAP SINGH 72

    72

    MOMENTUM:

    Empirical evidence shows that the use of a term called momentum in the backpropagationalgorithm can be helpful in speeding the convergence and avoiding local minima.

    The idea about using a momentum is to stabilize the weight change by making nonradicalrevisions using a combination of the gradient decreasing term with a fraction of the previousweight change:

    w(t) = -Ee/w(t) + w(t-1)

    where a is taken 0 a 0.9, and t is the index of the current weight change.

    This gives the system a certain amount of inertia since the weight vector will tend to continuemoving in the same direction unless opposed by the gradient term.

    The momentum has the following effects:

    - it smooths the weight changes and suppresses cross-stitching, that is cancels side-to-sideoscillations across the error valley;

    - when all weight changes are all in the same direction the momentum amplifies the learning ratecausing a faster convergence;

    - enables to escape from small local minima on the error surface.

    The hope is that the momentum will allow a larger learning rate and that this will speedconvergence and avoid local minima. On the other hand, a learning rate of 1 with no momentumwill be much faster when no problem with local minima or non-convergence is encountered

    LIMITATIONS OF NEURAL NETWORK :

    There are many advantages and limitations to neural network analysis and to discuss this subjectproperly we would have to look at each individual type of network, which isn't necessary for thisgeneral discussion. In reference to backpropagational networks however, there are some specificissues potential users should be aware of.

    Backpropagational neural networks (and many other types of networks) are in a sense theultimate 'black boxes'. Apart from defining the general architecture of a network andperhaps initially seeding it with a random numbers, the user has no other role than to feedit input and watch it train and await the output. In fact, it has been said that withbackpropagation, "you almost don't know what you're doing". Some software freelyavailable software packages (NevProp, bp, Mactivation) do allow the user to sample thenetworks 'progress' at regular time intervals, but the learning itself progresses on its own.The final product of this activity is a trained network that provides no equations orcoefficients defining a relationship (as in regression) beyond it's own internal mathematics.The network 'IS' the final equation of the relationship.

  • 8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

    74/74

    73

    Backpropagational networks also tend to be slower to train than other types of networksand sometimes require thousands of epochs. If run on a truly parallel computer systemthis issue is not really a problem, but if the BPNN is being simulated on a standard serialmachine (i.e. a single SPARC, Mac or PC) training can take some time. This is becausethe machines CPU must compute the function of each node and connection separately,which can be problematic in very large networks with a large amount of data. However,

    the speed of most current machines is such that this is typically not much of an issue.