an introduction to neural networksgrupen/403/slides/neuralnetworks.pdf · an introduction to neural...
TRANSCRIPT
AN INTRODUCTION TO NEURAL NETWORKS
Scott KuindersmaNovember 12, 2009
SUPERVISED LEARNING
• We are given some training data:
• We must learn a function
• If y is discrete, we call it classification
• If it is continuous, we call it regression
ARTIFICIAL NEURAL NETWORKS
• Artificial neural networks are one technique that can be used to solve supervised learning problems
• Very loosely inspired by biological neural networks
• real neural networks are much more complicated, e.g. using spike timing to encode information
• Neural networks consist of layers of interconnected units
PERCEPTRON UNIT
• The simplest computational neural unit is called a perceptron
• The input of a perceptron is a real vector x
• The output is either 1 or -1
• Therefore, a perceptron can be applied to binary classification problems
• Whether or not it will be useful depends on the problem... more on this later...
PERCEPTRON UNIT[MITCHELL 1997]
SIGN FUNCTION
EXAMPLE
• Suppose we have a perceptron with 3 weights:
• On input x1 = 0.5, x2 = 0.0, the perceptron outputs:
• where x0 = 1
LEARNING RULE• Now that we know how to calculate the output of a perceptron,
we would like to find a way to modify the weights to produce output that matches the training data
• This is accomplished via the perceptron learning rule
• for an input pair where, again, x0 = 1
• Loop through the training data until (nearly) all examples are classified correctly
MATLAB EXAMPLE
LIMITATIONS OF THE PERCEPTRON MODEL
• Can only distinguish between linearly separable classes of inputs
• Consider the following data:
PERCEPTRONS AND BOOLEAN FUNCTIONS
• Suppose we let the values (1,-1) correspond to true and false, respectively
• Can we describe a perceptron capable of computing the AND function? What about OR? NAND? NOR? XOR?
• Let’s think about it geometrically
BOOLEAN FUNCS CONT’DAND OR
NORNAND
EXAMPLE: AND
• Let pAND(x1,x2) be the output of the perceptron with weights w0 = -0.3, w1 = 0.5, w2 = 0.5 on input x1, x2
x1 x2 pAND(x1,x2)
-1 -1 -1
-1 1 -1
1 -1 -1
1 1 1
XOR
XOR• XOR cannot be represented by a perceptron, but it can be
represented by a small network of perceptrons, e.g.,
AND
OR
x1
x2
NAND
x1
x2
PERCEPTRON CONVERGENCE• The perceptron learning rule is not guaranteed to converge if the
data is not linearly separable
• We can remedy this situation by considering linear unit and applying gradient descent
• The linear unit is equivalent to a perceptron without the sign function. That is, its output is given by:
• where x0 = 1
LEARNING RULE DERIVATION
• Goal: a weight update rule of the form
• First we define a suitable measure of error
• Typically we choose a quadratic function so we have a global minimum
ERROR SURFACE [MITCHELL 1997]
LEARNING RULE DERIVATION
• The learning algorithm should update each weight in the direction that minimizes the error according to our error function
• That is, the weight change should look something like
GRADIENT DESCENT
GRADIENT DESCENT
• Good: guaranteed to converge to the minimum error weight vector regardless of whether the training data are linearly separable (given that α is sufficiently small)
• Bad: still can only correctly classify linearly separable data
NETWORKS
• In general, many-layered networks of threshold units are capable of representing a rich variety of nonlinear decision surfaces
• However, to use our gradient descent approach on multi-layered networks, we must avoid the non-differentiable sign function
• Multiple layers of linear units can still only represent linear functions
• Introducing the sigmoid function...
SIGMOID FUNCTION
SIGMOID UNIT [MITCHELL 1997]
EXAMPLE
• Suppose we have a sigmoid unit k with 3 weights:
• On input x1 = 0.5, x2 = 0.0, the unit outputs:
NETWORK OF SIGMOID UNITS
2 3 4
0 1
x0 x1 x2 x3
output layer
hidden layer
w02
w31
o2 o3 o4
EXAMPLE
3
1 2
x0x1 x2
.1 .2
.30 -.2
3.2
.5 -.51.0
EXAMPLE
3
1 2
x0x1 x2
.1 .2
.30 -.2
3.2
.5 -.51.0
−2
−1
0
1
2
−2−1.5−1−0.500.511.52
0.65
0.7
0.75
0.8
x1x2
output
BACK-PROPAGATION
• Really just applying the same gradient descent approach to our network of sigmoid units
• We use the error function:
BACKPROP ALGORITHM
BACKPROP CONVERGENCE
• Unfortunately, there may exist many local minima in the error function
• Therefore we cannot guarantee convergence to an optimal solution as in the single linear unit case
• Time to convergence is also a concern
• Nevertheless, backprop does reasonably well in many cases
MATLAB EXAMPLE
• Quadratic decision boundary
• Single linear unit vs. Three-sigmoid unit backprop network... GO!
BACK TO ALVINN
• ALVINN was a 1989 project at CMU in which an autonomous vehicle learned to drive by watching a person drive
• ALVINN's architecture consists of a single hidden layer back-propagation network
• The input layer of the network is a 30x32 unit two dimensional "retina" which receives input from the vehicles video camera
• The output layer is a linear representation of the direction the vehicle should travel in order to keep the vehicle on the road
ALVINN
REPRESENTATIONAL POWER OF NEURAL NETWORKS
• Every boolean function can be represented by a network with two layers of units
• Every bounded continuous function can be approximated to arbitrarily accuracy by a two-layer network of sigmoid hidden units and linear output units
• Any function can be approximated to arbitrarily accuracy by a three layer network sigmoid hidden units and linear output units
READING SUGGESTIONS
• Mitchell, Machine Learning, Chapter 4
• Russell and Norvig, AI a Modern Approach, Chapter 20