cascade neural network for binary mapping

3
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 4, NO. 1, JANUARY 1993 148 Letters Cascade Neural Network for Binary Mapping G. Martinelli, F. M. Mascioli, and G. Bei Abstract-The problem of choosing a suitable number of neurons for a neural network which realizes any given binary mapping is automatically solved by the proposed cascade architecture. The utilized algorithm, based on linear programming, the complexity of the resulting net, and its generalization capability are discussed. I. INTRODUCTION Multilayer perceptrons [ 11 can perform any arbitrary binary map- pings. However, no apriori realistic estimate can be made on the number of hidden neurons that are required. Several methods that have been recently proposed get around this problem [2]-[4]. The number of the neurons required by these algorithms exceeds the upper bound determined in [5]. It is therefore important to modify the previous methods in order to reduce, with respect to the said bound, the complexity of the resulting neural network. A possible solution to this problem is here proposed. Our method follows a geometrical approach. The geometrical properties of the classical parity problem are explored in order to introduce the particular structure of the perceptron we propose and the algorithm that automatically generates it. In the case of a complete mapping, we demonstrate the complexity requirement of the proposed architecture is better than the one determined in [5], both in the parity case and when the inputs are less than five. The extension of this performance to a larger number of inputs is suggested by simulation results; some of them are shown in Section VI. Moreover, it is important to consider the case of incomplete mappings, where the main performance of interest is the generalization capability of the network. As to this problem, a reasonable “smoothing criterion” is specified. There are several analogies between our contribution and some papers belonging to the field denoted “Threshold-Logic’’ [6], [7], but our point of view is substantially different. The algorithm we introduce automatically generates a neural network without a a priori assumptions, it is always convergent and can accept indifferently binary or real inputs. We focus our attention only on binary case mainly for remarking in a more intuitive way its geometrical aspects. The general case will be the object of future investigations. 11. NOTATIONS AND INTRODUCTORY REMARKS The neurons we consider are all linear threshold units connected by variable weights to the inputs .r, (i = 1.2. . . . . -I-). We restrict our attention to a feedfonvard cascade. Therefore, the output yh for the hth neuron is given by: The threshold coincides with wo and will be considered as the weight of a connection to an extra input that is set to 1 for all the input Manuscript received February 1, 1552. The authors are with the INFO-COM Department, University di Roma, IEEE Log Number 9201744. Rome, Italy. patterns. The second term appearing in ;12 regards the interaction among neurons The complete mapping is represented by 2.‘ examples, each constituted by the input pattern and the corresponding output. The input patterns are conveniently represented by the vertices of a hy- percube in the S-dimensional input space, labeled with 1 (positive) or 0 (negative) depending on the corresponding output. We are now able to give the following definitions: Definition 1: A vertex belongs to “class I<” if the said number of 1 is equal to I<, with 0 5 I< 5 -I-. Definition 2: A mapping is “linearly separable” when, in the representative hypercube, the positive vertices can be separated by the negative ones by means of only one hyperplane. In the case of a general mapping, the positive and negative vertices require several hyperplanes (i.e., neurons) in order to be separated. Definition 3: When the mapping to be realized in incomplete, the neural network must yield in a suitable way the missing part. This performance is what we have called “generalization capability” of the network. Smoothness criterion: The missing part is geometrically repre- sented in the hypercube by vertices without a label. A reasonable criterion, corresponding to a “smoothness strategy,” is to label each of them in the same manner as one of the neighboring labeled vertices. This means that the separating hyperplanes are chosen without paying attention to the unlabeled vertices. 111. PARITYPROBLEM In the parity problem the output should be 1 if the number of inputs ,rt equal to 1 is odd and 0 otherwise. Parity is often cited as a difficult problem for neural network to learn. A known solution exists consisting of a single layer of -I7 hidden neurons connected to an output neuron [1]. The study of this problem in the input space is very rewarding. In fact, the positive and negative vertices exactly belong tot he previously defined classes. In particular, class li contains positive vertices if Ii is odd. Definition 4: We call “diagonal ofthe hypercube” the straight line joining the vertex which belongs to class 0 (I b ) to the vertex which belongs to class S (I :\ ). Then, the following property holds: Parity geometrical property: The vertices of class Ji (1 5 Ji _< -1- - 1) lie in the hyperplane orthogonal to the diagonal and inter- secting it at a distance from I b equal to $$. The equation of this hyperplane is: 2 .rz = Ii. (2) ,=I Class Ii of vertices is consequently contained in the hyper-region delimited by two hyperplanes parallel to (2) and distant from vertex I; respectively (IC + bl)/(fl) and (IC - bz)/(o), with 0 < bl . (12 < 1. Therefore, the parity problem can be solved by using a set of .I- hyperplanes parallel to (2) characterized by: k,r,=0.5+J<. li=O.1 :...-Y -1. (3) ,=I Parity decision region: The said hyperplanes divide the hypercube of the input space in S + 1 hyper-regions, each containing the 1045-9227/53$03.00 0 1993 IEEE

Upload: g

Post on 11-Mar-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cascade neural network for binary mapping

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 4, NO. 1, JANUARY 1993 148

Letters

Cascade Neural Network for Binary Mapping

G. Martinelli, F. M. Mascioli, and G. Bei

Abstract-The problem of choosing a suitable number of neurons for a neural network which realizes any given binary mapping is automatically solved by the proposed cascade architecture. The utilized algorithm, based on linear programming, the complexity of the resulting net, and its generalization capability are discussed.

I. INTRODUCTION

Multilayer perceptrons [ 11 can perform any arbitrary binary map- pings. However, no apriori realistic estimate can be made on the number of hidden neurons that are required. Several methods that have been recently proposed get around this problem [2]-[4]. The number of the neurons required by these algorithms exceeds the upper bound determined in [5] . It is therefore important to modify the previous methods in order to reduce, with respect to the said bound, the complexity of the resulting neural network. A possible solution to this problem is here proposed. Our method follows a geometrical approach. The geometrical properties of the classical parity problem are explored in order to introduce the particular structure of the perceptron we propose and the algorithm that automatically generates it. In the case of a complete mapping, we demonstrate the complexity requirement of the proposed architecture is better than the one determined in [5] , both in the parity case and when the inputs are less than five. The extension of this performance to a larger number of inputs is suggested by simulation results; some of them are shown in Section VI. Moreover, it is important to consider the case of incomplete mappings, where the main performance of interest is the generalization capability of the network. As to this problem, a reasonable “smoothing criterion” is specified.

There are several analogies between our contribution and some papers belonging to the field denoted “Threshold-Logic’’ [6], [ 7 ] , but our point of view is substantially different. The algorithm we introduce automatically generates a neural network without a a priori assumptions, it is always convergent and can accept indifferently binary or real inputs. We focus our attention only on binary case mainly for remarking in a more intuitive way its geometrical aspects. The general case will be the object of future investigations.

11. NOTATIONS AND INTRODUCTORY REMARKS The neurons we consider are all linear threshold units connected

by variable weights to the inputs .r, ( i = 1.2 . . . . . -I-). We restrict our attention to a feedfonvard cascade. Therefore, the output yh for the hth neuron is given by:

The threshold coincides with w o and will be considered as the weight of a connection to an extra input that is set to 1 for all the input

Manuscript received February 1, 1552. The authors are with the INFO-COM Department, University di Roma,

IEEE Log Number 9201744. Rome, Italy.

patterns. The second term appearing in ;12 regards the interaction among neurons

The complete mapping is represented by 2.‘ examples, each constituted by the input pattern and the corresponding output. The input patterns are conveniently represented by the vertices of a hy- percube in the S-dimensional input space, labeled with 1 (positive) or 0 (negative) depending on the corresponding output. We are now able to give the following definitions:

Definition 1: A vertex belongs to “class I<” if the said number of 1 is equal to I<, with 0 5 I< 5 -I-.

Definition 2: A mapping is “linearly separable” when, in the representative hypercube, the positive vertices can be separated by the negative ones by means of only one hyperplane. In the case of a general mapping, the positive and negative vertices require several hyperplanes (i.e., neurons) in order to be separated.

Definition 3: When the mapping to be realized in incomplete, the neural network must yield in a suitable way the missing part. This performance is what we have called “generalization capability” of the network.

Smoothness criterion: The missing part is geometrically repre- sented in the hypercube by vertices without a label. A reasonable criterion, corresponding to a “smoothness strategy,” is to label each of them in the same manner as one of the neighboring labeled vertices. This means that the separating hyperplanes are chosen without paying attention to the unlabeled vertices.

111. PARITY PROBLEM

In the parity problem the output should be 1 if the number of inputs ,rt equal to 1 is odd and 0 otherwise. Parity is often cited as a difficult problem for neural network to learn. A known solution exists consisting of a single layer of -I7 hidden neurons connected to an output neuron [1]. The study of this problem in the input space is very rewarding. In fact, the positive and negative vertices exactly belong tot he previously defined classes. In particular, class li contains positive vertices if Ii is odd.

Definition 4: We call “diagonal ofthe hypercube” the straight line joining the vertex which belongs to class 0 ( I b ) to the vertex which belongs to class S ( I :\ ). Then, the following property holds:

Parity geometrical property: The vertices of class Ji (1 5 Ji _< -1- - 1) lie in the hyperplane orthogonal to the diagonal and inter- secting it at a distance from I b equal to $$. The equation of this hyperplane is:

2 .rz = Ii. (2) , = I

Class Ii of vertices is consequently contained in the hyper-region delimited by two hyperplanes parallel to (2) and distant from vertex I; respectively (IC + b l ) / ( f l ) and (IC - b z ) / ( o ) , with 0 < bl . (12 < 1. Therefore, the parity problem can be solved by using a set of .I- hyperplanes parallel to (2) characterized by:

k , r , = 0 . 5 + J < . l i = O . 1 :...-Y -1. (3) , = I

Parity decision region: The said hyperplanes divide the hypercube of the input space in S + 1 hyper-regions, each containing the

1045-9227/53$03.00 0 1993 IEEE

Page 2: Cascade neural network for binary mapping

lEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 4, NO. 1, JANUARY 1993

vertices of only one class. In the case of -V = 3 , the parity decision region is shown in Fig. 1.

As it will be shown, such a decision region can be obtained by means of a feedfonvard cascade network where the output of each hidden neuron controls in a suitable way the threshold of the previous neuron, in order to generate parallel hyperplanes.

The application of the algorithm for the parity problem realizes the scheme of Fig. 2. The generated network requires a number &If of neurons equal to the integer greater than or equal to F. Consequently, the resulting complexity is nearly half that required by the known approach.

IV. CASCADE PERCEFTRON

The scheme of Fig. 2 can be generalized as shown in Fig. 3. It will be denoted as “Cascade Perceptron” (CP) because of its form. The CP is characterized by .\I neurons, each connected to the -1- inputs and having the threshold controlled by the neuron of the successive layers. This can be excitatory (when the connection weight is positive) or inhibitory (when it is negative). The resulting architecture is a feedfonvard cascade.

CP decision region: When a controlling neuron is active (i.e., its output is l), the threshold of the controlled one changes depending on the sign and the value of the connection weight ( d ) . More precisely, the controlled thrahold ( T ) changes in the following way: Let r l , or d , be respectively the excitatory or the inhibitory connection

a) If the controlling neuron is excitatory, in the hyper-region where

b) If the controlling neuron is inhibitory in the hyper-region where

weight.

it is active: T becomes T + c l , .

it is active: T becomes T - cl, . It is important to note that the hyperplane generated by the

controlling neuron can be incident or parallel to the one generated by the controlled neuron. Consequently, the decision region can be tooth-shaped in the first case or sliced in the second, as shown in Fig. 4. When the problem requires a sliced region (i.e., parallel hyperplanes), as in the binary case, the cascade architecture generated by our algorithm results very convenient in terms of required number of neurons.

V. CP-GENERATING ALGORITHM

The CP-generating algorithm (CPA) is a modified version of the algorithm presented in [4]. In order to explain CPA, let us to introduce the following definitions;

Linear constraint system: I t is the system of training examples written in a linear programming (LP) formalism, where variables are all nonnegative; i.e.:

with c = 1.2. . . . . C‘, where C‘ is the number of examples or linear constraints. The variables .I.:!‘ I, .r:”‘” are the kth input components of the cth positive and negative constraint; S,,,., S!,? and S:?:) are surplus or slack variables. ‘‘A” is a positive quantity introduced for a better separation of the two classes of linear constraints. Its value, arbitrarily chosen, determines the CP “insensitiveness ” to noise.

149

Fig. 1. Parity decision region for -Y = 3.

+-

x i X N X , X N x i X N x i X N

Fig 2 Scheme for solving the parity problem

‘1 ‘N ‘1 ‘N ‘1 ‘N

Fig. 3. Cascade perceptron.

Tooth-shaped region Sliced region

Fig. 4. Examples of CP decision regions in the case of two neurons and for -1- = 2 .

Complementary system: It is the same system with positive and negative constraints interchanged. In this case surplus variables are S,,,., Sit), and S$,’.

At each step of CPA we solve an LP problem. The objective function of each LP problem is to minimize the number of violated constraints. Since this objective cannot be handled in a linear form, we have to find out a “trick” in order to utilize LP techniques. What we need is to determine at each step a hyperplane (i.e., a neuron) such to divide the hypercube in two hyper-regions. Without loss of generality, we can affirm that the first hyper-region must contain all positive vertices together with some negative vertices (violated constraints), and the second one must contain only negative vertices. We point out that the optimal choice requires to maximize the number of vertices contained in the second region. This goal can be achieved by a numerical sub-routine inspired by the classical simplex algorithm:

Sub-routine “CUT”: Since the linear constraint system is in a nonadmissible form for the simplex, CUT leads the exchange of the base variables, by means of successive pivot operations in order to

Page 3: Cascade neural network for binary mapping

150 IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. 4. NO. I . JANUARY 1993

obtain an admissible system. Any fail-variable S!?) different from 0 indicates the corresponding constraint failure. Sub-routine mechanism tries to avoid, where it is possible, the entry of fail-variables inside the base. It happens as follows:

Choose the nonadmissible constraint with the constant term having maximum magnitude. In this one, choose the first variable with negative coefficient ( y ). Consider the admissible constraints where the coefficient of y is positive. Calculate for them the ratio between the constant term and the coefficient of y. Carry out the exchange of y using the constraint where the said ratio is minimum. Apply again 2) and 3) until there are not further nonadmissible constraints.

The number of exchange operations is usually much less than C‘. In general mappings, after the first application of sub-routine CUT,

some negative constraints result to be violated. It means the first neuron needs a successive controlling neuron. The latter modifies the threshold of the first in order to solve the violated constraints. The same considerations apply to the new neuron. It is possible to demonstrate that at each application of CUT at least one of the negative constraints is solved. Since the examples corresponding to the solved constraints will be disregarded, the number of examples to be considered (C) decreases with the successive applications. Consequently, CPA is always convergent. Moreover, it is important to note that the previous strategy is in accordance with the type of generalization we are interested on, as discussed in Section 11. In conclusion, CPA is a recursive call of the following:

Procedure “GENERATION”: Consider the examples remaining at the end of the previous call. Apply CUT to the corresponding linear constraint system. Let -YC: be the number of negative constraints of the second region and the corresponding weights. Apply CUT to the complementary system (where the fail-variable is Sj,:)). Let -1-C‘f’ denote the number of positive constraints of the second region and u t ; ’ the corresponding weights. If -1-C: > -1-C:’ choose wi. as the weights of the generated neuron; otherwise, choose wl.‘. If the procedure is generating a controlling neuron, with the first choice it will be inhibitory, with the second excitatory. The connection weight ( d ) is treated as a further input to the controlled neuron. Stop when all the constraints are satisfied.

Remark: The computational cost of the previous algorithm can be reduced by recurring to an analog neural coprocessor [9].

VI. COMPLEXITY REQUIREMENTS AND CONCLUSIONS

CPA depends on the preliminary ordering of training examples. In fact, sub-routine CUT leads the exchange operations according to the scanning of the constraints. For this reason, the number of neurons (31) of the generated CP can be different for a same problem. The optimal number coincides with the minimum number of hyperplanes necessary for shaping the decision region. The determination of this number is difficult. This conclusion is easily drawn from the examination of similar problems [8]. In the binary case, and for S < 5, we demonstrate that CPA achieves the optimal number of required neurons by a proper ordering of the examples. This ordering leads CPA “to cut the hypercube into slices.” This idea is inspired by the considerations made about the parity problem:

Ordering “SLICES”: Training examples are ordered in succession class by class starting from li = 0. In each class the examples are divided in two subclasses on the basis of the output. The first subclass of the class li has the same output (1 or 0) as the second subclass of class I< - 1.

= 2: In this case we need a maximum of two neurons, since we need two straight lies for separating the previous group in

Case

TABLE I

Maximum Number of Required Neurons

Number of Inputs Parity Case General Case

5 3 4 6 4 5 I 4 7 8 5 10 9 5 17

the most difficult mapping (XOR). Case -1- = 3: Class 1 contains three vertices, that can be divided

in whatever two subclasses by a suitable plane. Since also class 2 can be divided in a similar way, we need a maximum of three planes. Two of them can be chosen parallel one another and hence a maximum of two neurons is required.

Case AY = 4: Class 2 of the hypercube contains six vertices. These vertices lie on the hyperplane H orthogonal to its diagonal at a distance 4 from vertex T il. Since H is three-dimensional, we need in the corresponding three-dimensional space a maximum of two planes for separating the six vertices in groups of examples with the same output. It is possible to determine two hyperplanes, each containing one of these two planes, such to separate class 2 from classes 1 and 3, respectively. Since classes 1 and 3 can be separated in whatever two subclasses by a hyperplane, we can obtain any required separation of the vertices of the hypercube by means of five hyperplanes. Moreover, four of them can be arranged in two pairs of parallel hyperplanes. In conclusion, we need a maximum of three neurons.

If we compare the previous numbers with the upper bounds established in [ 5 ] , i.e., six and eight neurons in the case of -1- = 3 and -1- = 4, we conclude the convenience of the CP architecture. See also [7] , where four units are required for -1- = 4

Several experiments were carried out with a larger number of inputs. We have considered randomized binary functions with -1- in the range 5-9 and with about 50% of outputs equal to 1. Table I summarizes CPA performances:

If we use a casual ordering instead of SLICES, the number of neurons for the same problem generally increases in a limited way. In the case of an incomplete mapping, CPA generates less neurons for the same number of inputs, since it determines the separating hyperplanes only on the basis of the labeled vertices.

REFERENCES

D. E. Rumelhart and J. L. McClelland, Para//el Distributed Processing. Cambridge, MA: MIT Press, 1986. J. Nadal, “Study of a growth algorithm for neural networks,” Int. J . Naira/ Syst., vol. 1, pp. 55-59. 1989. M. Mezard and J. Nadal, “Learning in feedfonvard layered networks: The tiling algorithm,” J . Phys. A, vol. 22, pp. 2191 -2203, 1989. G. Martinelli, L. Prina-Ricotti, S. Ragazzini, and F. M. Mascioli, “A pyramidal delayed perceptron,” IEEE Trans. Circuits Syst., vol. 37, pp. 1176-1181, 1990. S.C. Huang and Y.F. Huang, “Bounds on the number of hidden neurons in multilayer perceptron,” IEEE Trans. Neural Networks, vol. 2,

D. L. Ostapko and S. S. Ydu, “Realization of an arbitrary switching func- tion with a two-level network,” IEEE Trans. Computers, pp. 262-269, Mar. 1970. S. Ghosh, D. Basu, and A.K. Choudhury, “Multigate synthesis of general boolean functions by threshold logic elements,” IEEE Trans. Computers, pp. 451-456, May 1969. P. E. O’Neil, “Hyperplane cuts of an N-Cube,” Discrete Mathematics, vol. 1, pp. 193-195, 1971. A. Rodriguez-Vdsquez et al . , “Switched capacitor neural network for linear programing,” Electron. Lett.. vol. 24, pp. 496-498, 1988.

pp. 47-55, 1991.