an introduction to neural networks - · pdf fileneural networks in order to combine ... alvinn...

An Introduction To Neural Networks

Neural Networks

In order to combine the powers of the machine and the human brain, Neural Networks try to mimic the structure and function of our nervous system.

Biological Motivation #1

Synapses

Axon

Dendrites

Synapses+++--

(weights)

Nodes

Biological Neural Systems

  Neuron switching time : > 10-3 secs   Connections (synapses) per neuron : ~104–105   Number of neurons in the human brain: ~1010

  Face recognition : 0.1 secs   High degree of parallel computation   Distributed representations

Biological Motivation #2

Appropriate Problem Domains for Neural Network (backprop) Learning

  Input is high-dimensional discrete or real-valued (e.g. raw sensor input)

  Output is discrete or real valued   Output is a vector of values   Possibly noisy data   Form of target function is unknown   Fast evaluation may be required   Humans do not need to interpret the results (black box

model)

Threshold T Output y

Input x1

Input x2

Input x3

Input x4

Weight w1

Weight w2

Weight w3

Weight w4

If w1x1 + w2x2 + … + wnxn ≥ T,

then the output of n is 1.

Otherwise,

the output of n is 0.

A Single Perceptron

Linearly Separable x1 x2 x1 and x2

0 0 0

0 1 0

1 0 0

1 1 1

x1

x2

x1 x2 x1 or x2

0 0 0

0 1 1

1 0 1

1 1 1

x1

x2

x1 x2 x1 xor x2

0 0 0

0 1 1

1 0 1

1 1 0

x1

x2

Perceptrons

  1969 book by Marvin Minsky and Seymour Papert

  The problem is that they can only work for classification problems that are linearly separable

  Insufficiently expressive   “Important research problem” to investigate

multilayer networks although they were pessimistic about their value

Perceptrons - another views

T = 2 Output y

Input x1

Input x2

W1 = 1

W2 = 1

AND

Inputs are either 0 or 1

Output is 1 only if all inputs are 1

Output y

Input x1

Input x2

W1 = ?

W2 = ?

AND


Input x0

W0 = ?

Output y

Input x1

Input x2

W1 = 0.5

W2 = 0.5

AND


Input x0

W0 = -0.8

Training

 Train a perceptron to respond to certain inputs with certain desired outputs

 After training, the perceptron should give reasonable outputs for any input

  If it wasn’t trained for that input, it should try to find the best possible output depending on how it was trained

Perceptron Training Rule

  Begin with random weights   Apply the perceptron to each training example

(each pass through examples is called an epoch)

  If it misclassifies an example, modify the weights   Continue until the perceptron classifies all

training examples correctly

Modifying the Weights

wi ← wi + ∆wi

∆wi = LearningRate(DesiredOutput – ActualOutput)xi

Usually set to some small value like 0.1.

Moderates the degree to which the weights are changed at each step.

Keeps it from overshooting.


wi ← wi + ∆wi


This is the difference between what we wanted the output to be and what it actually was.

If the desired and actual are equal, then this is 0 and the weight won’t change.


wi ← wi + ∆wi


The value of the input itself.

If this value was 0, then it had no impact on the error, and so its weight shouldn’t be adjusted.

EXAMPLE

  Begin with random weights   Apply the perceptron to each training example

(each pass through examples is called an epoch)

  If it misclassifies an example, modify the weights   wi = wi + LearningRate(DesiredOutput – ActualOutput)xi

  Continue until the perceptron classifies all training examples correctly

Gradient Descent Learning Rule   Consider linear unit without threshold and

continuous output o (not just 0,1)  o=w0 + w1 x1 + … + wn xn

  Train the wi’s such that they minimize the squared error

 E[w1,…,wn] = ½ Σd∈D (td-od)2

where D is the set of training examples

Perceptrons - another views

(w1,w2)

(w1+Δw1,w2 +Δw2)

Gradient Descent

Incremental Stochastic Gradient Descent

  Batch mode : gradient descent w=w - η ∇ED[w] over the entire data D

ED[w]=1/2Σd(td-od)2

  Incremental mode: gradient descent w=w - η ∇Ed[w] over individual training examples d Ed[w]=1/2 (td-od)2

Incremental Gradient Descent can approximate Batch Gradient

Descent arbitrarily closely if η is small enough

Comparison: Perceptron and Gradient Descent Rule

Perceptron learning rule guaranteed to succeed (perfectly classifying training examples) if

  Training examples are linearly separable   Sufficiently small learning rate η Linear unit training rules using gradient descent   Guaranteed to converge to hypothesis with minimum squared error   Given sufficiently small learning rate η   Even when training data contains noise   Even when training data not separable by H

Restaurant Problem: Will I wait for a table?

  Alternate – whether there is a suitable alternative restaurant nearby

  Bar – whether the restaurant has a comfortable bar area to wait in   Fri/Sat – true on Fridays and Saturdays   Hungry – whether we are hungry   Patrons – how many people are in the restaurant (None, Some or

Full)   Price – the restaurants price range ($, $$, $$$)   Raining – whether its is raining outside   Reservation – whether we made a reservation   Type – the kind of restaurant (French, Italian, Thai, or Burger)   WaitEstimate – the wait estimate by the host (0-10 minutes, 10-30,

30-60, > 60)

Multilayer Network

A compromise function   Perceptron

  Linear

  Sigmoid (Logistic)

€

output = net = wixii=0

n

∑

€

output =σ (net) =1

1+ e−net

€

output =1 if wixi > 0

i=0

n

∑0 else

#

$ %

& %

Learning in Multilayer Networks

 Same method as for Perceptrons  Example inputs are presented to the

network   If the network computes an output that

matches the desired, nothing is done   If there is an error, then the weights are

adjusted to balance the error

BackPropagation Learning

Alternative Error Measures

Neural Network Model Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2 Gender

Stage 4

.6

.5

.8

.2

.1

.3 .7

.2

Weights HiddenLayer

“Probability of beingAlive”

0.6 Σ

Σ

.4

.2 Σ

Getting an answer from a NN Inputs

Weights

Output


Dependent variable

Prediction

Age 34

2 Gender

Stage 4

.6

.5

.8

.1

.7

Weights HiddenLayer


0.6 Σ

Inputs

Weights

Output


Dependent variable

Prediction

Age 34

2 Gender

Stage 4

.5

.8

.2

.3

.2

Weights HiddenLayer


0.6 Σ

Getting an answer from a NN

Getting an answer from a NN Inputs

Weights

Output


Dependent variable

Prediction

Age 34

1 Gender

Stage 4

.6

.5

.8

.2

.1

.3 .7

.2

Weights HiddenLayer


0.6 Σ

Minimizing the Error

w initial w trained

initial error

final error

Error surface

positive change

negative derivative

local minimum

Representational Power (FFNN)

 Boolean functions  2 layers of units

 Continuous functions  2 layers of units (sigmoid then linear)

 Arbitrary functions  3 layers of units (sigmoids then linear)

Hypothesis Space and Inductive Bias

Hidden Layer Representations

Overfitting

Handwritten Character Recognition

  Le Cun et al. (1989) implemented a neural network to read zip codes on hand-addressed envelopes, for sorting purposes

  To identify the digits, uses a 16x16 array of pixels as input, 3 hidden layers, and a distributed output encoding with 10 output units for digits 0-9

  256 input nodes, 10 output units (1 for the liklihood of each number)

ALVINN Drives 70 mph on a public highway

Camera image

30x32 pixels as inputs

30 outputs for steering 30x32 weights

into one out of four hidden unit

4 hidden units

Neural Nets for Face Recognition

Learning Hidden Unit Weights

Interpreting Satellite Imagery for Automated Weather Forecasting

Summary

 Perceptrons, one layer networks, are insufficiently expressive

 Multi-layer networks are sufficiently expressive and can be trained by error back-propogation

 Many applications including speech, driving, hand written character recognition, fraud detection, driving, etc.

an introduction to neural networks - · pdf fileneural networks in order to combine ... alvinn...

Documents