september 21, 2010neural networks lecture 5: the perceptron 1 supervised function approximation in...

September 21, 2010 Neural Networks Lecture 5: The Perceptron

1

Supervised Function ApproximationSupervised Function Approximation

In supervised learning, we train an ANN with a set of In supervised learning, we train an ANN with a set of vector pairs, so-called vector pairs, so-called exemplarsexemplars..

Each pair (Each pair (xx, , yy) consists of an input vector ) consists of an input vector xx and a and a corresponding output vector corresponding output vector yy. .

Whenever the network receives input Whenever the network receives input xx, we would like , we would like it to provide output it to provide output yy..

The exemplars thus describe the function that we The exemplars thus describe the function that we want to “teach” our network.want to “teach” our network.

Besides Besides learninglearning the exemplars, we would like our the exemplars, we would like our network to network to generalizegeneralize, that is, give plausible output , that is, give plausible output for inputs that the network had not been trained with.for inputs that the network had not been trained with.


2


There is a tradeoff between a network’s ability to There is a tradeoff between a network’s ability to precisely learn the given exemplars and its ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate).generalize (i.e., inter- and extrapolate).

This problem is similar to This problem is similar to fitting a functionfitting a function to a given to a given set of data points.set of data points.

Let us assume that you want to find a fitting function Let us assume that you want to find a fitting function f:f:RRRR for a set of three data points. for a set of three data points.

You try to do this with polynomials of degree one (a You try to do this with polynomials of degree one (a straight line), two, and nine.straight line), two, and nine.


3


Obviously, the polynomial of degree 2 provides the Obviously, the polynomial of degree 2 provides the most plausible fit.most plausible fit.

f(x)f(x)

xx

deg. 1deg. 1

deg. 2deg. 2

deg. 9deg. 9


4


The same principle applies to ANNs:The same principle applies to ANNs:

• If an ANN has If an ANN has too fewtoo few neurons, it may not have neurons, it may not have enough degrees of freedom to precisely enough degrees of freedom to precisely approximate the desired function. approximate the desired function.

• If an ANN has If an ANN has too manytoo many neurons, it will learn the neurons, it will learn the exemplars perfectly, but its additional degrees of exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior freedom may cause it to show implausible behavior for untrained inputs; it then presents poor for untrained inputs; it then presents poor ability of generalization. ability of generalization.

Unfortunately, there are Unfortunately, there are no known equationsno known equations that that could tell you the optimal size of your network for a could tell you the optimal size of your network for a given application; there are only heuristics.given application; there are only heuristics.


5

Evaluation of NetworksEvaluation of Networks• Basic idea: Basic idea: define error function and measure define error function and measure

error for untrained data (testing set)error for untrained data (testing set)

• Typical:Typical:

where d is the desired output, and o is the actual where d is the desired output, and o is the actual output.output.

• For classification:For classification:E = number of misclassified samples/E = number of misclassified samples/ total number of samples total number of samples

i

ii od 2)(E


6

The PerceptronThe Perceptronx1

x2

xn

…

W1W2

…Wn

f(x1,x2,…,xn)

unit i

net input signal

output net if,1)net(f

otherwise,1

threshold

n

iiixw

1

net


7

The PerceptronThe Perceptronx1

x2

xn

…

W1W2

…Wn

f(x1,x2,…,xn)

unit i

net input signal

output

threshold 0

0net if,1)net( f

otherwise,1

n

iiixw

0

net

x0 1

W0

W0 corresponds to -

Here, only the weight vector is adaptable, but not the threshold


8

Perceptron ComputationPerceptron Computation

Similar to a TLU, a perceptron divides its n-dimensional Similar to a TLU, a perceptron divides its n-dimensional input space by an (n-1)-dimensional hyperplane defined input space by an (n-1)-dimensional hyperplane defined by the equation:by the equation:

ww00 + w + w11xx11 + w + w22xx22 + … + w + … + wnnxxnn = 0 = 0

For wFor w00 + w + w11xx11 + w + w22xx22 + … + w + … + wnnxxnn > 0, its output is 1, and > 0, its output is 1, and

for wfor w00 + w + w11xx11 + w + w22xx22 + … + w + … + wnnxxnn 0, its output is -1. 0, its output is -1.

With the right weight vector (wWith the right weight vector (w00, …, w, …, wnn))TT, a single , a single perceptron can compute any linearly separable function.perceptron can compute any linearly separable function.

We are now going to look at an algorithm that We are now going to look at an algorithm that determines such a weight vector for a given function.determines such a weight vector for a given function.


9

Perceptron Training AlgorithmPerceptron Training Algorithm

AlgorithmAlgorithm Perceptron; Perceptron;

Start with a randomly chosen weight vector Start with a randomly chosen weight vector ww00;;

Let k = 1;Let k = 1;

whilewhile there exist input vectors that are there exist input vectors that are misclassified by misclassified by wwk-1k-1, , dodo

Let Let iijj be a misclassified input vector; be a misclassified input vector;

Let Let xxkk = class( = class(iijj))iijj, implying that , implying that wwk-1k-1xxkk < 0; < 0;

Update the weight vector to Update the weight vector to wwkk = = wwk-1k-1 + + xxkk;;

Increment k;Increment k;

end-whileend-while;;


10

Perceptron Training AlgorithmPerceptron Training Algorithm

For example, for some input For example, for some input ii with class( with class(ii) = -1,) = -1,

If If wwii > 0, then we have a misclassification. > 0, then we have a misclassification.

Then the weight vector needs to be modified to Then the weight vector needs to be modified to ww + + ww

with (with (ww + + ww))ii < < wwii to possibly improve classification. to possibly improve classification.

We can choose We can choose ww = - = -ii, because, because

((ww + + ww))ii = ( = (ww - - ii))ii = = wwii - - iiii < < wwii,,

and and iiii is the square of the length of vector is the square of the length of vector ii and is thus and is thus positive.positive.

If class(If class(ii) = 1, things are the same but with opposite ) = 1, things are the same but with opposite signs; we introduce signs; we introduce xx to unify these two cases. to unify these two cases.


11

Learning Rate and TerminationLearning Rate and Termination• Terminate when all samples are correctly classified.Terminate when all samples are correctly classified.

• If the number of misclassified samples has not changed If the number of misclassified samples has not changed in a large number of steps, the problem could be the in a large number of steps, the problem could be the choice of learning rate choice of learning rate ::

• If If is too large, classification may just be swinging back is too large, classification may just be swinging back and forth and take a long time to reach the solution;and forth and take a long time to reach the solution;

• On the other hand, if On the other hand, if is too small, changes in is too small, changes in classification can be extremely slow.classification can be extremely slow.

• If changing If changing does not help, the samples may not be does not help, the samples may not be linearly separable, and training should terminate.linearly separable, and training should terminate.

• If it is known that there will be a minimum number of If it is known that there will be a minimum number of misclassifications, train until that number is reached.misclassifications, train until that number is reached.


12

Guarantee of Success: Novikoff (1963)Guarantee of Success: Novikoff (1963)

Theorem 2.1:Theorem 2.1: Given training samples from two linearly Given training samples from two linearly separable classes, the perceptron training algorithm separable classes, the perceptron training algorithm terminates after a finite number of steps, and correctly terminates after a finite number of steps, and correctly classifies all elements of the training set, irrespective of classifies all elements of the training set, irrespective of the initial random non-zero weight vector the initial random non-zero weight vector ww00..

Let Let wwkk be the current weight vector. be the current weight vector.

We need to prove that there is an upper bound on k.We need to prove that there is an upper bound on k.


13


Proof:Proof: Assume Assume = 1, without loss of generality. = 1, without loss of generality.

After k steps of the learning algorithm, the current After k steps of the learning algorithm, the current weight vector isweight vector is

wwkk = = ww00 + + xx11 + + xx22 + + …… + + xxkk. (2.1). (2.1)

Since the two classes are linearly separable, there Since the two classes are linearly separable, there must be a vector of weights must be a vector of weights ww* that correctly classifies * that correctly classifies them, that is, sgn(them, that is, sgn(ww**iikk) = class() = class(iikk).).

Multiplying each side of eq. 2.1 with Multiplying each side of eq. 2.1 with ww*, we get:*, we get:

ww** w wkk = = ww**ww00 + + ww**xx11 + + ww**xx22 + + …… + + ww**xxkk. .


14


ww** w wkk = = ww**ww00 + + ww**xx11 + + ww**xx22 + + …… + + ww**xxkk. .

For each input vector For each input vector iijj, the dot product , the dot product ww**iijj has the has the same sign as class(same sign as class(iijj).).

Since the corresponding element of the training Since the corresponding element of the training sequence sequence xx = class( = class(iijj))iijj, we can be assured that, we can be assured that

ww**xx = = ww**(class((class(iijj))iijj) > 0.) > 0.

Therefore, there exists an Therefore, there exists an > 0 such that > 0 such that ww**xxii > > for every member for every member xxii of the training sequence. of the training sequence.

Hence:Hence:

ww** w wkk > > ww**ww00 + k+ k. (2.2). (2.2)


15


ww** w wkk > > ww**ww00 + k+ k. (2.2). (2.2)

By the Cauchy-Schwarz inequality:By the Cauchy-Schwarz inequality:

||ww**wwkk||22 || ||ww*||*||22 ||||wwkk||||22. (2.3). (2.3)

We may assume that that ||We may assume that that ||w*w*|||| = 1, since the unit = 1, since the unit length vector length vector ww*/||*/||ww*|| also correctly classifies the *|| also correctly classifies the same samples.same samples.

Using this assumption and eqs. 2.2 and 2.3, we obtain Using this assumption and eqs. 2.2 and 2.3, we obtain a lower bound for the square of the length of a lower bound for the square of the length of wwkk::

||||wwkk||||22 > ( > (ww**ww00 + k+ k)) 2 2. (2.4). (2.4)


16


SinceSince w wjj = = wwj-1j-1 + + xxjj, the following upper bound can be , the following upper bound can be obtained for this vector’s squared length:obtained for this vector’s squared length:

||||wwjj||||22 = = wwjj wwj j

= = wwj-1j-1wwj-1j-1 + 2 + 2wwj-1j-1xxjj + + xxj j xxjj

= ||= ||wwj-1j-1||||22 + 2 + 2wwj-1j-1xxjj + || + ||xxjj||||22

SinceSince w wj-1j-1xxjj < 0 whenever a weight change is required by < 0 whenever a weight change is required by the algorithm, we have:the algorithm, we have:

||||wwjj||||22 - || - ||wwj-1j-1||||22 < || < ||xxjj||||22

Summation of the above inequalities over j = 1, …, k gives Summation of the above inequalities over j = 1, …, k gives an upper boundan upper bound

||||wwkk||||22 - || - ||ww00||||22 < k < k max || max ||xxjj||||22


17

Guarantee of Success: Novikoff (1963)Guarantee of Success: Novikoff (1963)||||wwkk||||22 - || - ||ww00||||22 < k < k max || max ||xxjj||||22

Combining this with inequality 2.4:Combining this with inequality 2.4:

||||wwkk||||22 > ( > (ww**ww00 + k+ k)) 2 2 (2.4) (2.4)

Gives us:Gives us:

((ww**ww00 + k+ k)) 2 2 < ||< ||wwkk||||22 < || < ||ww00||||22 + k + k max || max ||xxjj||||22

Now the lower bound of ||Now the lower bound of ||wwkk||||22 increases at the rate of k increases at the rate of k22, , and its upper bound increases at the rate of k.and its upper bound increases at the rate of k.

Therefore, there must be a finite value of k such that:Therefore, there must be a finite value of k such that:

((ww**ww00 + k+ k)) 2 2 > ||> ||ww00||||22 + k + k max || max ||xxjj||||22

This means that k cannot increase without bound, so that This means that k cannot increase without bound, so that the algorithm must eventually terminate.the algorithm must eventually terminate.

september 21, 2010neural networks lecture 5: the perceptron 1 supervised function approximation in...

Documents

neural networks lecture

error function

desired function

desired output

networks ability

input x

actual output

plausible output