chapter 4

15
L 4 L l The Perceptron 4.1 Introduction The perceptron is the simplest form of a neural network used for the classification of a special type of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). Basically, it consists of a single neuron with adjustable synaptic weights and threshold, as shown in Fig. 4.1. The algorithm used to adjust the free parameters of this neural network first appeared in a learning procedure developed by Rosenblatt (1958, 1962) for his perceptron brain model.’ Indeed, Rosenblatt proved that if the patterns (vectors) used to train the perceptron are drawn from two linearly separable classes, then the perceptron algorithm converges and positions the decision surface in the form of a hyperplane between the two classes. The proof of convergence of the algorithm is known as the perceptron convergence theorem. The single-layer perceptron depicted in Fig. 4.1 has a single neuron. Such a perceptron is limited to performing pattern classification with only two classes. By expanding the output (computation) layer of the perceptron to include more than one neuron, we may correspondingly form classification with more than two classes. However, the classes would have to be linearly separable for the perceptron to work properly. Accordingly, insofar as the basic theory of a single-layer perceptron as a pattern classifier is concerned, we need consider only the single-neuron configuration shown in Fig. 4.1. The extension of the theory so developed to the case of more than one neuron is a trivial matter. Organization of the Chapter This chapter on the single-layer perceptron is relatively short, but it is important for historical and theoretical reasons. In Section 4.2 we present some basic considerations involved in the operation of the perceptron. Then, in Section 4.3 we describe Rosenblatt’s original algorithm for adjusting the synaptic weight vector of the perceptron for pattern classification of linearly separable classes, and demonstrate convergence of the algorithm. This is followed by the description of a performance measure for the single-layerperceptron in Section 4.4. In Section 4.5 we consider the relationship between the single-layer perceptron and the maximum-likelihood Gaussian classifier. The chapter concludes with general discussion in Section 4.6. The network organization of the original version of the perceptron, as envisioned by Rosenblatt (1962), has three types of units: sensory units, association units, and response units. The connections from the sensory units to the association units have fixed weights, and the connections from the association units to the response units have variable weights. The association units act as preprocessors designed to extract a pattern from the environmental input. Insofar as the variable weights are concerned, the operation of Rosenblatt’s original perceptron is essentially the same as that depicted in Fig. 4.1 for the case of a single response unit. 106

Upload: armin2200

Post on 08-Jun-2015

172 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Chapter 4

L 4 L l The Perceptron

4.1 Introduction The perceptron is the simplest form of a neural network used for the classification of a special type of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). Basically, it consists of a single neuron with adjustable synaptic weights and threshold, as shown in Fig. 4.1. The algorithm used to adjust the free parameters of this neural network first appeared in a learning procedure developed by Rosenblatt (1958, 1962) for his perceptron brain model.’ Indeed, Rosenblatt proved that if the patterns (vectors) used to train the perceptron are drawn from two linearly separable classes, then the perceptron algorithm converges and positions the decision surface in the form of a hyperplane between the two classes. The proof of convergence of the algorithm is known as the perceptron convergence theorem.

The single-layer perceptron depicted in Fig. 4.1 has a single neuron. Such a perceptron is limited to performing pattern classification with only two classes. By expanding the output (computation) layer of the perceptron to include more than one neuron, we may correspondingly form classification with more than two classes. However, the classes would have to be linearly separable for the perceptron to work properly. Accordingly, insofar as the basic theory of a single-layer perceptron as a pattern classifier is concerned, we need consider only the single-neuron configuration shown in Fig. 4.1. The extension of the theory so developed to the case of more than one neuron is a trivial matter.

Organization of the Chapter This chapter on the single-layer perceptron is relatively short, but it is important for historical and theoretical reasons. In Section 4.2 we present some basic considerations involved in the operation of the perceptron. Then, in Section 4.3 we describe Rosenblatt’s original algorithm for adjusting the synaptic weight vector of the perceptron for pattern classification of linearly separable classes, and demonstrate convergence of the algorithm. This is followed by the description of a performance measure for the single-layer perceptron in Section 4.4. In Section 4.5 we consider the relationship between the single-layer perceptron and the maximum-likelihood Gaussian classifier. The chapter concludes with general discussion in Section 4.6.

The network organization of the original version of the perceptron, as envisioned by Rosenblatt (1962), has three types of units: sensory units, association units, and response units. The connections from the sensory units to the association units have fixed weights, and the connections from the association units to the response units have variable weights. The association units act as preprocessors designed to extract a pattern from the environmental input. Insofar as the variable weights are concerned, the operation of Rosenblatt’s original perceptron is essentially the same as that depicted in Fig. 4.1 for the case of a single response unit.

106

Page 2: Chapter 4

4.2 / Basic Considerations 107

Inputs

FIGURE 4.1 Single-layer perceptron.

4.2 Basic Considerations Basic to the operation of Rosenblatt's perceptron is the McCulloch-Pitts model of a neuron. From Chapter 1, we recall that such a neuron model consists of a linear combiner followed by a hard limiter, as depicted in Fig. 4.2. The summing node of the neuron model computes a linear combination of the inputs applied to its synapses, and also accounts for an externally applied threshold. The resulting sum is applied to a hard limiter. Accordingly, the neuron produces an output equal to +1 if the hard limiter input is positive, and -1 if it is negative.

In the signal-flow graph model of Fig. 4.2, the synaptic weights of the single-layer perceptron are denoted by wlr w2, . . . , w,. (We have simplified the notation here by not including an additional subscript to identify the neuron, since we only have to deal with a single neuron.) Correspondingly, the inputs applied to the perceptron are denoted by xlr xz, . . . , xp. The externally applied threshold is denoted by 8. From the model of Fig. 4.2 we thus find that the linear combiner output (i.e., hard limiter input) is

u = C wiXi - e i=l

(4.1)

The purpose of the perceptron is to classify the set of externally applied stimuli xl, xz, . . . , xp into one of two classes, Cel or %z, say. The decision rule for the classification is to assign the point represented by the inputs xl, xz , . . . , xp to class (e if the perceptron output y is + 1 and to class CeZ if it is - 1.

To develop insight into the behavior of a pattern classifier, it is customary to plot a map of the decision regions in the p-dimensional signal space spanned by the p input variables xl, xz, . . . , xp . In the case of an elementary perceptron, there are two decision regions separated by a hyperplane defined by

i: wixi - i=l

Inputs

X1

x2 1. P

\ w . 4

o = o (4.2)

/' T h r e k l d

FIGURE 4.2 Signal-flow graph of the perceptron.

Page 3: Chapter 4

108 4 / The Perceptron

This is illustrated in Fig. 4.3 for the case of two input variables x, and xz, for which the decision boundary takes the form of a straight line. A point (xl,xz) that lies above the boundary line is assigned to class %1, and a point (xI,xz) that lies below the boundary line is assigned to class (e2. Note also that the effect of the threshold 8 is merely to shift the decision boundary away from the origin.

The synaptic weights wlr wz, . . . , wp of the perceptron can be fixed or adapted on an iteration-by-iteration basis. For the adaptation, we may use an error-correction rule known as the perceptron convergence algorithm, developed in the next section.

4.3 The Perceptron Convergence Theorem For the development of the error-correction learning algorithm for a single-layer per- ceptron, we find it more convenient to work with the modified signal-flow graph model of Fig. 4.4. In this second model, which is equivalent to that of Fig. 4.2, the threshold e(n) is treated as a synaptic weight connected to a fixed input equal to - 1. We may thus define the ( p + 1)-by-1 input vector

Correspondingly, we define the ( p + 1)-by-1 weight vector

w(n) = w&), wh), . . . , wp(n)lr (4.4)

u(n) = wT(n)x(n) (4.5)

Accordingly, the linear combiner output is written in the compact form

For fixed n, the equation wTx = 0, plotted in a p-dimensional space with coordinates xl, x2, . . . , x,, defines a hyperplane as the decision surface between two different classes of inputs.

Suppose then the input variables of the single-layer perceptron originate from two linearly separable classes that fall on the opposite sides of some hyperplane. Let XI be the subset of training vectors xl(l), x1(2), . . . that belong

x2 I

to class Gel, and let X2 be the

X1

Decision boundary wlxl + w2x2 - 8 = 0

FIGURE 4.3 Illustrating linear reparability for a two-dimensional, two-class pattern- classification problem.

Page 4: Chapter 4

4.3 I The Perceptron Convergence Theorem 109

Fixed x,,=-Io, input

Inputs Y

W Hard limiter Y

xo Linear . . combiner

FIGURE 4.4 Equivalent signal-flow graph of the perceptron.

subset of training vectors x2(1), x2(2), . . . that belong to class S2. The union of XI and X , is the complete training set X. Given the sets of vectors XI and Xz to train the classifier, the training process involves the adjustment of the weight vector w in such a way that the two classes and %, are separable. These two classes are said to be linearly separable if a realizable setting of the weight vector w exists. Conversely, if the two classes % I and %, are known to be linearly separable, then there exists a weight vector w such that we may state

wTx 2 0 for every input vector x belonging to class

and (4.6)

wrx < 0 for every input vector x belonging to class %,

Given the subsets of training vectors XI and X 2 , the training problem for the elementary perceptron is then to find a weight vector w such that the two inequalities of Eq. (4.6) are satisfied.

The algorithm for adapting the weight vector of the elementary perceptron may now be formulated as follows:

1. If the nth member of the training vector, x(n), is correctly classified by the weight vector w(n) computed at the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron, as shown by

w(n + 1) = w(n) if w'(n>x(n) 2 0 and x(n) belongs to class

and (4.7)

w(n + 1) = w(n) if w'(n)x(n) < 0 and x(n) belongs to class %,

2. Otherwise, the weight vector of the perceptron is updated in accordance with the

if wT(n)x(n) 2 0 and x(n) belongs to class sz rule

w(n + 1) = w(n) - ~(n)x(n)

and (4.8)

w(n + 1) = w(n) + 7(n)x(n) where the learning-rate parameter ~ ( n ) controls the adjustment applied to the weight vector at iteration n.

If q(n) = 7 > 0, where 7 is a constant independent of the iteration number n, we have a f i e d increment adaptation rule for the perceptron.

In the sequel, we first prove the convergence of a fixed increment adaptation rule for which 7 = 1. Clearly, the value of 77 is unimportant, so long as it is positive. A value

if w'(n)x(n) < 0 and x(n) belongs to class

Page 5: Chapter 4

110 4 / The Perceptron

of q f: 1 merely scales the pattern vectors without affecting their separability. The case of a variable q(n) is considered later in the section.

The proof is presented for the initial condition w(0) = 0. Suppose that wT(n)x(n) < 0 for n = 1, 2, . . . and an input vector x(n) that belongs to the subset X , . That is, the perceptron incorrectly classifies the vectors x(l), x(2), . . . , since the second condition of Eq. (4.6) is violated. Then, with the constant ~ ( n ) = 1, we may use the second line of Eq. (4.8) to write

w(n + 1) = w(n) + x(n) for x(n) belonging to class (4.9) Given the initial condition w(0) = 0, we may iteratively solve this equation for w(n + l), obtaining the result

w(n + 1) = x(1) + x(2) + * - - + x(n) (4.10)

Since the classes (el and %2 are assumed to be linearly separable, there exists a solution wo for which wTx(n) > 0 for the vectors x( I), . . . , x(n) belonging to the subset X,. For a fixed solution wo, we may then define a positive number a by the relation

(4.1 1)

where x(n) E Xl stands for “x(n) belongs to subset XI . ’ ’ Hence, multiplying both sides of Eq. (4.10) by the row vector wi, we get

wiw(n + 1) = WtX(1) + WtX(2) + * . . + wtx(n>

Accordingly, in light of the definition given in Eq. (4.1 l), we have

(4.12)

wiw(n + 1) 2 na (4.13)

Next, we make use of an inequality known as the Cauchy-Schwarz inequality. Given two vectors wo and w(n + l), the Cauchy-Schwarz inequality states that

llwol1211w(n + 1)Ip 2 [wiw(n + 1)12 (4.14)

where /(*I1 denotes the Euclidean norm of the enclosed argument vector, and the inner product w;w(n + 1) is a scalar quantity. We now note from Fiq. (4.13) that [wiw(n + 1)12 is equal to or greater than n2a2. Moreover, from Eq. (4.14) we note that Ilwol1211w(n + 1)112 is equal to or greater than [w;w(n + 1)12. It follows therefore that

Ilwo(1211w(n + 1)y 2 n2a2

or, equivalently,

(4.15)

Next, we follow another development route. In particular, we rewrite Eq. (4.9) in the form

w(k + 1) = w(k) + x(k) fork = 1, . . . , n and x(k) E XI (4.16)

Hence, taking the squared Euclidean norm of both sides of Eq. (4.16), we get

Ilw<k + 1)112 = llw(k>112 + llX(k)1l2 + 2wT(k)x(k) (4.17)

But, under the assumption that the perceptron incorrectly classifies an input vector x(k) belonging to the subset X, , we have wT(k)x(k) < 0. We therefore deduce from Eq. (4.17) that

Ilw@ + 1)IP 5 llW(W + Ilx(k)l12

Page 6: Chapter 4

4.3 / The Perceptron Convergence Theorem 11 1

or, equivalently,

Ilw<k + - l l ~ ( k ) ( ( ~ 5 llx(k)112, k = 1, . . . , (4.18)

Adding these inequalities for k = 1, . . . , n, and assuming that the initial condition w(0) = 0, we get the following condition:

n

lIw(n + 1)112 5 c llx(k>ll" k=l

5 n p (4.19)

where P is a positive number defined by

P = lIw112 (4.20)

Equation (4.19) states that the squared Euclidean norm of the weight vector w(n + 1) grows at most linearly with the number of iterations n.

Clearly, the second result of Eq. (4.19) is in conflict with the earlier result of Eq. (4.15) for sufficiently large values of n. Indeed, we can state that n cannot be larger than some value n,, for which Eqs. (4.15) and (4.19) are both satisfied with the equality sign. That is, n,, is the solution of the equation

Solving for nmx for a solution vector wo, we find that

(4.21)

We have thus proved that for q(n) = I for all n, and w(0) = 0, and given that a solution vector wo exists, the rule for adapting the synaptic weights connecting the associator units to the response unit of the perceptron must terminate after at most n,, iterations. Note also from Eqs. (4.11), (4.20), and (4.21) that there is no unique solution for wo or nm,.

We may now state the Jixed-increment convergence theorem for a single-layer per- ceptron as follows (Rosenblatt, 1962):

Let the subsets of training vectors X , and X, be linearly separable. Let the inputs presented to the single-layer perceptron originate from these two subsets. The perceptron converges after some no iterations, in the sense that

w(no) = w(no + 1) = w(no + 2) = . . . is a solution vector for no 5 n,, .

Consider next the absolute error-correction procedure for the adaptation of a single- layer perceptron, for which q(n) is variable. In particular, let q(n) be the smallest integer for which

rp(W(n)x(n> > lwWx(n)l With this procedure we find that if the inner product w'(n)x(n) at iteration n has an incorrect sign, then wr(n + l)x(n) at iteration n + 1 would have the correct sign. This suggests that if w'(n)x(n) has an incorrect sign, we may modify the training sequence at iteration n + 1 by setting x(n + 1) = x(n). In other words, each pattern is presented repeatedly to the perceptron until that pattern is classified correctly.

Note also that the use of an initial value w(0) different from the null condition merely results in a decrease or increase in the number of iterations required to converge, depending

Page 7: Chapter 4

112 4 / The Perceptron

on how w(0) relates to the solution wo. Hence, regardless of the value assigned to w(O), the single-layer perceptron of Fig. 4.1 is assured of convergence.

Summary

In Table 4.1 we present a summary of the perceptron convergence algorithm (Lippmann, 1987). The symbol sgn(-), used in step 3 of the table for computing the actual response of the perceptron, stands for the signum finction:

+1 i f u > 0

-1 i f u < O sgn(u) = (4.22)

TABLE 4.1 Summary of the Perceptron Convergence Algorithm

Variables and Parameters

x(n) = ( p + 1)-by-1 input vector

= [- 1, xdn), xz(n), . . . 1 x,(n)l'

= [B(n), w1(n), w&), . . . , wp(n>lT

w(n) = ( p + 1)-by-1 weight vector

B(n) = threshold

y(n) = actual response (quantized)

d(n) = desired response

77 = learning-rate parameter, a positive constant less than unity

Step 7: Initialization

Set w(0) = 0. Then perform the following computations for time n = 1, 2, . . . . Step 2: Activation

At time n, activate the perceptron by applying continuous-valued input vector x(n) and desired response d(n).

Step 3: Computation of Actual Response

Compute the actual response of the perceptron:

y(n) = sgn[w'(n)x(n)l

where sgn(-) is the signum function.

Step 4: Adaptation of Weight Vector

Update the weight vector of the perceptron:

w(n + 1) = w(n) + rl[d(n) - y(n)lx(n)

where

+ 1

- 1

if x(n) belongs to class

if x(n) belongs to class

Step 5

{ d(n) =

Increment time n by one unit, and go back to step 2.

Page 8: Chapter 4

4.4 I Performance Measure 113

In effect, the signum function is a shorthand notation for the asymmetric form of hard limiting. Accordingly, the quantized response y(n) of the perceptron is written in the compact form

An) = sgn(w'(n)x(n)) (4.23)

Note that the input vector x(n) is a ( p + 1)-by-1 vector whose first element is fixed at - 1 throughout the computation in accordance with the definition of Eq. (4.3). Correspond- ingly, the weight vector w(n) is a ( p + 1)-by-1 vector whose first element equals the threshold q n ) , as defined in Eq. (4.4). In Table 4.1 we have also introduced a quantized desired response d(n), defined by

+ 1

-- 1

if x(n) belongs to class

if x(n) belongs to class %* d(n) = (4.24)

Thus, the adaptation of the weight vector w(n) is summed up nicely in the form of an error-correction learning rule, as shown by

w(n + 1) = w(n) + q[d(n) - y(n)]x(n) (4.25)

where q is the Zeaming-rate parameter, and the difference d(n) - y(n) plays the role of an error signal. The learning-rate parameter is a positive constant limited to the range 0 < q 5 1. In assigning a value to it inside this range, we have to keep in mind two conflicting requirements (Lippmann, 1987):

Averaging of past inputs to provide stable weight estimates, which requires a small q Fast adaptation with respect to real changes in the underlying distributions of the process responsible for the generation of the input vector x, which requires a large 77

4.4 Performance Measure In the previous section we derived the perceptron convergence algorithm without any reference to a performance measure, yet in Chapter 2 we emphasized that improvement in the performance of a learning machine takes place over time in accordance with a performance measure. What then is a suitable performance measure for the single-layer perceptron? An obvious choice would be the auerage probability of classijication error, defined as the average probability of the perceptron making a decision in favor of a particular class when the input vector is drawn from the other class. Unfortunately, such a performance measure does not lend itself readily to the analytic derivation of a learning algorithm. Shynk (1990) has proposed the use of a performance function that befits the operation of a perceptron, as shown by the statistical expectation

J = -E[e(n)u(n)] (4.26)

where e(n) is the error signal defined as the difference between the desired response d(n) and the actual response y(n) of the perceptron, and v(n) is the linear combiner output (Le., hard-limiter input). Note that the error signal e(n) is not a quantized signal, but rather the difference between two quantized signals. The instantaneous estimate of the performance function is the time-varying function

S(n) = -e(n>v(n)

= - y(n>lv(n) (4.27)

Page 9: Chapter 4

114 4 I The Perceptron

The instantaneous gradient vector is defined as the derivative of the estimate &n) with respect to the weight vector w(n), as shown by

(4.28)

Hence, recognizing that the desired response d(n) is independent of w(n) and the fact that the actual response y(n) has a constant value equal to - 1 or + 1, we find that the use of Eq. (4.27) in (4.28) yields2

For the defining equation (4.5), we readily find that3

Thus the use of Eq. (4.30) in (4.29) yields the result

v w . m = -[d(n) - y(n)lx(n)

(4.29)

(4.30)

(4.3 1)

Hence, according to the error-correction rule described in Chapter 2 we may now express the change applied to the weight vector as

A w ( ~ ) = - qV,.T(n)

= q[d(n) - y(n>lx(n) (4.32)

To be precise, we should also include a Dirac delta function (Le., impulse function) whenever the perceptron output y(n) goes through a transition at u(n) = 0. Averaged over a long sequence of iterations, however, such a transition has a negligible effect.

Consider a scalar quantity u defined as the inner product of two real-valued p-by-1 vectors w and x, as shown by

P

i=l v = WTX = 2 wix,

where wi and xi are the ith elements of the vectors w and x, respectively. The partial derivative of the quantity v with respect to the vector w is defined by (Haykin, 1991)

auiaw,

av aw [::I -=

From the given definition of u , we note that

x,, i = 1 , 2 , . . . , p av - awi --

We therefore have

Page 10: Chapter 4

4.5 / Maximum-Likelihood Gaussian Classifier 115

where r ) is the learning-rate parameter. This is precisely the correction made to the weight vector as we progress from iteration n to n -I- 1, as described in Eq. (4.25). We have thus demonstrated that the instantaneous performance function j (n ) defined in Eq. (4.27) is indeed the correct performance measure for a single-layer perceptron (Shynk, 1990).

4.5 Maximum-Likelihood Gaussian Classifier A single-layer perceptron bears a close relationship to a classical pattern classifier known as the maximum-likelihood Gaussian ClassiJier, in that they are both examples of linear classiJiers (Duda and Hart, 1973).

The maximum-likelihood method is a classical parameter-estimation procedure that views the parameters as quantities whose values are@ed but unknown. The best estimate is defined to be the one that maximizes the probability of obtaining the samples actually observed. To be specific, suppose that we separate a set of samples according to class, so that we have M subsets of samples denoted by XI, X,, . . . , X,. Let the samples in subset Xj be drawn independently according to a probability law of known parametric form. Let this probability law be described by the conditional probability densityfunction f(xlwj), where x is an observation vector and wj represents an unknown parameter vector associated with class T j , j = 1, 2, . . . , M. We further assume that the samples in X , give no information about wj if i # j , that is, the parameters for the different classes are functionally independent. Now, viewed as a function of w, f(x)w) is called the likelihood of w with respect to the observation vector x. The maximum-likelihood estimate of w is, by definition, that value ii. which maximizes the likelihood f(x)w).

To proceed with the issue at hand, consider a p-by-1 observation vector x defined by

(4.33)

Note that the vector x as defined here is confined exclusively to the externally applied inputs. The observation vector x is described in terms of a mean vector p and covariance matrix C, which are defined by, respectively,

x = [XI, x2, . . . , X p ] T

p = E[xl (4.34)

c = - p)(x - PYI (4.35)

and

where E is the expectation operator. Assuming that the vector x is Gaussian-distributed, we may express the joint-probability density function of the elements of x as follows (Van Trees, 1968):

(4.36)

where det C is the determinant of the covariance matrix C, the matrix C-' is the inverse of C, and p is the dimension of the vector x.

To be more specific, suppose that we have a two-class problem, with the vector x characterized as follows, depending on whether it belongs to class or class %,:

x EX,: meanvector = p1

covariance matrix = C x EX,: mean vector = p2

covariance matrix = C

Page 11: Chapter 4

116 4 / The Perceptron

In other words, the vector x has a mean vector the value of which depends on whether it belongs to class %1 or class (e2, and a covariance matrix that is the same for both classes. It is further assumed that:

rn The classes %l and (e2 are equiprobable. rn The samples of class

rn The covariance matrix C is nonsingular, so that the inverse matrix C-' exists.

We may then use Eq. (4.36) to express the joint-probability density function of the

or (e2 are correlated, so that the covariance matrix C is nondiagonal.

input vector x as follows:

(4.37)

where the index i = 1, 2. Taking the natural logarithm of both sides of Eq. (4.37) and then expanding terms, we get

pTC-'pl (4.38) lnf(xl%J = - - ln(2n) - - ln(det C) - - xTC-'x + prC-'x - - 2 2 2 2

The first three terms on the right-hand side of Eq. (4.38) are independent of the index i. Hence, for the purpose of pattern classification considered here, we may define a log- likelihood as follows:

P 1 1 1

.

1 2

&(X) = pTC-lx - - pTC-'pi

where i = 1, 2. For class TI, the log-likelihood is

1 ll(X) = p p x - - 2 pfC-'pl

(4.39)

(4.40)

For class %2, the log-likelihood is

(4.41) 1 &(x) = p p x - 5 p;c-1p,

Hence, subtracting Eq. (4.41) from (4.40), we get

1 = &(x) - Iz(x)

(4.42) 1 2

which is linearly related to the input vector x. To emphasize this property of the composite log-likelihood 1, we rewrite Eq. (4.42) in the equivalent form

= (p1 - p2)Tc-1x - - (pp- lp ' - p2TC-'pz)

1 = fiTx - 8

= i: *jxj - e i= 1

(4.43)

where fi is the maximum-likelihood estimate of the unknown parameter vector w, defined by

w = c-yp.1 - p*) I (4.44)

and 8 is a constant threshold, defined by

(4.45) 1 2 8 = - (pfC-1 p1- p;c-'p2)

Page 12: Chapter 4

4.5 I Maximum-Likelihood Gaussian Classifier 117

The maximum-likelihood Gaussian classiJer or Gaussian classijier (for short) may thus be implemented by a linear combiner as shown in the signal-flow graph of Fig. 4.5.

Equations (4.44) and (4.45) simplify considerably for the following special case. All the samples of the vector x are uncorrelated with each other, and also have a common variance uz, regardless of the associated class; that is, the covariance matrix C is a diagonal matrix defined by

c = u2I (4.46)

where I is the p-by-p identity matrix. We now have

1 u 2

C-'= -1

Accordingly, Eqs. (4.44) and (4.45) simplify as, respectively,

1 % = - ( (+2 11.1 - 11.2)

and

(4.47)

(4.48)

(4.49)

where )).I) denotes the Euclidean norm of the enclosed vector. In any case, treating 1 as the overall classifier output, the Gaussian classifier, character-

ized by the parameter vector % and threshold 8, solves the pattern-classification problem in accordance with the following rule:

If 1 2 0, then 1, 2 lZ and therefore assign x to class (el (4.50)

If 1 < 0, then lZ > lI and therefore assign x to class (ez

The operation of the Gaussian classifier is analogous to that of the single-layer per- ceptron in that they are both linear classifiers; see Eqs. (4.2) and (4.43). There are, however, some subtle and important differences between them, which should be carefully noted as discussed here:

w The single-layer perceptron operates on the premise that the patterns to be classified are linearly separable. The Gaussian distribution of the two patterns assumed in the derivation of the maximum-likelihood Gaussian classifier do certainly overlap each other and are therefore not exactly separable; the extent of the overlap is determined by the mean vectors pl and p2, and the covariance matrices C1 and Cz . The nature of this overlap is illustrated in Fig. 4.6 for the special case of a scalar input x (i.e., input dimension p = 1). When the inputs are nonseparable and their distributions overlap as described here, the perceptron convergence algorithm develops a problem

FIGURE 4.5 Signal-flow graph of Gaussian classifier.

Page 13: Chapter 4

118 4 I The Perceptron

Decision boundary

X

% % FIGURE 4.6 Two overlapping, one-dimensional Gaussian distributions.

in that the decision boundaries between the different classes may oscillate continu- ously (Lippmann, 1987).

The maximum-likelihood Gaussian classifier minimizes the average probability of classification error. This minimization is independent of the overlap between the underlying Gaussian distributions of the two classes. For example, in the special case illustrated in Fig. 4.6, the maximum-likelihood Gaussian classifier always posi- tions the decision boundary at the point where the two Gaussian dstributions cross each other. This positioning of the decision boundary is perfectly in accord with the optimum Bayes 's hypothesis-testing procedure on the assumption that the two classes are equiprobable (ie., there is no prior information to be incorporated into the decision making).

The perceptron convergence algorithm is nonparametric in the sense that it makes no assumptions concerning the form of the underlying distributions; it operates by concentrating on errors that occur where the distributions overlap. It may thus be more robust than classical techniques, and work well when the inputs are generated by nonlinear physical mechanisms, and whose distributions are heavily skewed and non-Gaussian (Lippmann, 1987). In contrast, the maximum-likelihood Gaussian classifier is parametric; its derivation is contingent on the assumption that the underlying distributions are Gaussian, which may limit its area of application.

The perceptron convergence algorithm is both adaptive and simple to implement; its storage requirement is confined to the set of synaptic weights and threshold. On the other hand, the design of the maximum-likelihood Gaussian classifier is fixed; it can be made adaptive, but at the expense of increased storage requirement and more complex computations (Lippmann, 1987).

4.6 Discussion The study of the single-layer perceptron presented in this chapter has been formulated in the context of the McCulloch-Pitts formal model of a neuron. The nonlinear element of this model is represented by a hard limiter at its output end. It would be tempting to think that we can do better if we were to use a sigmoidal nonlinear element in place of the hard limiter. Well, it turns out that the steady-state, decision-making characteristics of a single-layer perceptron are basically the same, regardless of whether we use hard-limiting or soft-limiting as the source of nonlinearity in the neural model (Shynk and Bershad,

Page 14: Chapter 4

4.6 I Discussion 119

1991, 1992; Shynk, 1990). We may therefore state formally that so long as we limit ourselves to the model of a neuron that consists of a linear combiner followed by a nonlinear element, then regardless of the form of nonlinearity used, a single-layer perceptron can perform pattern classification only on linearly separable patterns.

Linear separability requires that the patterns to be classified must be sufficiently sepa- rated from each other to ensure that the decision surfaces consist of hyperplanes. This requirement is illustrated in Fig. 4.7a for the case of a two-dimensional, single-layer perceptron. If now the two patterns TI and (e2 are allowed to move too close to each other, as in Fig. 4.7b, they become nonlinearly separable and the elementary perceptron fails to classify them.

The first real critique of Rosenblatt’s perceptron was presented by Minsky and Selfridge (1961). Minsky and Selfridge pointed out that the perceptron as defined by Rosenblatt could not even generalize toward the notion of binary parity, let alone make general abstractions. They also suggested the roles that connectionist networks might play in the implementation of larger systems.

The computational limitations of Rosenblatt’ s perceptron were subsequently put on a solid mathematical foundation in the famous book by Minsky and Papert (1969, 1988). This is a thorough and well-written book. After the presentation of some brilliant and highly detailed mathematical analysis of the perceptron, Minsky and Papert proved that the perceptron as defined by Rosenblatt is inherently incapable of making some global generalizations on the basis of locally learned examples. In the last chapter of their book, Minsky and Papert go on to make the conjecture that the limitations of the kind they had discovered for Rosenblatt’s perceptron would also hold true for its variants, more specifically, multilayer neural networks. Quoting from Section 13.2 of the book by Minsky and Papert (1969):

( b)

FIGURE 4.7 (a) A pair of linearly separable patterns. (b) A pair of nonlinearly separable patterns.

Page 15: Chapter 4

120 4 / The Perceptron

The perceptron has shown itself worthy of study despite (and even because of !) its severe limitations. It has many features to attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgement that the extension to multilayer systems is sterile.

This conjecture cast serious doubt on the computational capabilities of not only the perceptron but neural networks in general up to the mid-1980s.

History has shown, however, that the conjecture made by Minsky and Papert seems to be unjustified in that we now have several advanced forms of neural networks that are computationally more powerful than Rosenblatt’s perceptron. For example, multilayer perceptrons trained with the back-propagation algorithm discussed in Chapter 6 and the radial basis-function networks discussed in Chapter 7 overcome the computational limitations of the single-layer perceptron.

4.1

4.2

4.3

4.4

4.5 I

4.6

The single-layer perceptron is a linear pattern classifier. Justify this statement.

Consider two one-dimensional, Gaussian-distributed classes common variance equal to 1. Their mean values are as follows:

and %z that have a

p1 = -10

pz = 4-10

These two classes are essentially linearly separable. Design a classifier that separates these two classes.

Verify that Eqs. (4.22)-(4.25), summarizing the perceptron convergence algorithm, are consistent with Eqs. (4.7) and (4.8).

Equations (4.44) and (4.45) define the weight vector and threshold of a maximum- likelihood Gaussian classifier, assuming that the samples of classes or %z are correlated. Suppose now that these samples are uncorrelated but have different vari- ances, so that for both classes we may write

C = diag[cr:, (T;, . . . , c r 3 Find the weight vector and threshold of the classifier so defined.

In Eq. (4.27) we defined one form of an instantaneous performance function for a single-layer perceptron. Another way of expressing this performance measure is as follows (Shynk, 1990):

j (n ) = Iu(n)l - d(n) u(n)

where u(n) is the hard-limiter input, and d(n) is the desired response. Using this performance measure, find the instantaneous gradient vector V,j (n) , and show that the expression so derived is identical with that in Eq. (4.31).

The perceptron may be used to perform numerous logic functions. Demonstrate the implementation of the binary logic functions AND, OR, and COMPLEMENT. However, a basic limitation of the perceptron is that it cannot implement the EXCLU- SIVE OR function. Explain the reason for this limitation.