an introduction to neural networksgrupen/403/slides/neuralnetworks.pdf · an introduction to neural...

AN INTRODUCTION TO NEURAL NETWORKS

Scott KuindersmaNovember 12, 2009

SUPERVISED LEARNING

• We are given some training data:

• We must learn a function

• If y is discrete, we call it classification

• If it is continuous, we call it regression

ARTIFICIAL NEURAL NETWORKS

• Artificial neural networks are one technique that can be used to solve supervised learning problems

• Very loosely inspired by biological neural networks

• real neural networks are much more complicated, e.g. using spike timing to encode information

• Neural networks consist of layers of interconnected units

PERCEPTRON UNIT

• The simplest computational neural unit is called a perceptron

• The input of a perceptron is a real vector x

• The output is either 1 or -1

• Therefore, a perceptron can be applied to binary classification problems

• Whether or not it will be useful depends on the problem... more on this later...

PERCEPTRON UNIT[MITCHELL 1997]

SIGN FUNCTION

EXAMPLE

• Suppose we have a perceptron with 3 weights:

• On input x1 = 0.5, x2 = 0.0, the perceptron outputs:

• where x0 = 1

LEARNING RULE• Now that we know how to calculate the output of a perceptron,

we would like to find a way to modify the weights to produce output that matches the training data

• This is accomplished via the perceptron learning rule

• for an input pair where, again, x0 = 1

• Loop through the training data until (nearly) all examples are classified correctly

MATLAB EXAMPLE

LIMITATIONS OF THE PERCEPTRON MODEL

• Can only distinguish between linearly separable classes of inputs

• Consider the following data:

PERCEPTRONS AND BOOLEAN FUNCTIONS

• Suppose we let the values (1,-1) correspond to true and false, respectively

• Can we describe a perceptron capable of computing the AND function? What about OR? NAND? NOR? XOR?

• Let’s think about it geometrically

BOOLEAN FUNCS CONT’DAND OR

NORNAND

EXAMPLE: AND

• Let pAND(x1,x2) be the output of the perceptron with weights w0 = -0.3, w1 = 0.5, w2 = 0.5 on input x1, x2

x1 x2 pAND(x1,x2)

-1 -1 -1

-1 1 -1

1 -1 -1

1 1 1

XOR• XOR cannot be represented by a perceptron, but it can be

represented by a small network of perceptrons, e.g.,

AND

OR

x1

x2

NAND

x1

x2

PERCEPTRON CONVERGENCE• The perceptron learning rule is not guaranteed to converge if the

data is not linearly separable

• We can remedy this situation by considering linear unit and applying gradient descent

• The linear unit is equivalent to a perceptron without the sign function. That is, its output is given by:

• where x0 = 1

LEARNING RULE DERIVATION

• Goal: a weight update rule of the form

• First we define a suitable measure of error

• Typically we choose a quadratic function so we have a global minimum

ERROR SURFACE [MITCHELL 1997]

LEARNING RULE DERIVATION

• The learning algorithm should update each weight in the direction that minimizes the error according to our error function

• That is, the weight change should look something like

GRADIENT DESCENT

GRADIENT DESCENT

• Good: guaranteed to converge to the minimum error weight vector regardless of whether the training data are linearly separable (given that α is sufficiently small)

• Bad: still can only correctly classify linearly separable data

NETWORKS

• In general, many-layered networks of threshold units are capable of representing a rich variety of nonlinear decision surfaces

• However, to use our gradient descent approach on multi-layered networks, we must avoid the non-differentiable sign function

• Multiple layers of linear units can still only represent linear functions

• Introducing the sigmoid function...

SIGMOID FUNCTION

SIGMOID UNIT [MITCHELL 1997]

EXAMPLE

• Suppose we have a sigmoid unit k with 3 weights:

• On input x1 = 0.5, x2 = 0.0, the unit outputs:

NETWORK OF SIGMOID UNITS

2 3 4

0 1

x0 x1 x2 x3

output layer

hidden layer

w02

w31

o2 o3 o4

EXAMPLE

3

1 2

x0x1 x2

.1 .2

.30 -.2

3.2

.5 -.51.0

EXAMPLE

3

1 2

x0x1 x2

.1 .2

.30 -.2

3.2

.5 -.51.0

−2

−1

0

1

2

−2−1.5−1−0.500.511.52

0.65

0.7

0.75

0.8

x1x2

output

BACK-PROPAGATION

• Really just applying the same gradient descent approach to our network of sigmoid units

• We use the error function:

BACKPROP ALGORITHM

BACKPROP CONVERGENCE

• Unfortunately, there may exist many local minima in the error function

• Therefore we cannot guarantee convergence to an optimal solution as in the single linear unit case

• Time to convergence is also a concern

• Nevertheless, backprop does reasonably well in many cases

MATLAB EXAMPLE

• Quadratic decision boundary

• Single linear unit vs. Three-sigmoid unit backprop network... GO!

BACK TO ALVINN

• ALVINN was a 1989 project at CMU in which an autonomous vehicle learned to drive by watching a person drive

• ALVINN's architecture consists of a single hidden layer back-propagation network

• The input layer of the network is a 30x32 unit two dimensional "retina" which receives input from the vehicles video camera

• The output layer is a linear representation of the direction the vehicle should travel in order to keep the vehicle on the road

ALVINN

REPRESENTATIONAL POWER OF NEURAL NETWORKS

• Every boolean function can be represented by a network with two layers of units

• Every bounded continuous function can be approximated to arbitrarily accuracy by a two-layer network of sigmoid hidden units and linear output units

• Any function can be approximated to arbitrarily accuracy by a three layer network sigmoid hidden units and linear output units

READING SUGGESTIONS

• Mitchell, Machine Learning, Chapter 4

• Russell and Norvig, AI a Modern Approach, Chapter 20

an introduction to neural networksgrupen/403/slides/neuralnetworks.pdf · an introduction to neural...

Documents