learning in feed forward layered networks: the tiling algorithm authors – marc mezard and jean...
TRANSCRIPT
Learning in feed forward layered Networks: The Tiling Algorithm
Authors – Marc Mezard and Jean Pierre Nadal. Dated 21st June 1989
Published at J. Phys. A: Math. Gen.
Presented by :-
Anwesha DasPrachi GargAsmita Sharma
ABSTRACT
►Builds a feed forward layered network in order to learn any Boolean function of N Boolean units.
►The number of layers and the number of hidden units are not known in advance.
►Proposes an Algorithm for the network growth, adding layers & units within a layer.
►Convergence is guaranteed.
OutlineOutline
1. INTRODUCTION / BACKGROUND1. INTRODUCTION / BACKGROUND
2. TILING ALGORITHM DESCRIPTION2. TILING ALGORITHM DESCRIPTION
3. XOR PROBLEM3. XOR PROBLEM
4. SIMULATIONS AND RESULTS4. SIMULATIONS AND RESULTS
5. CONCLUSIONS AND REMARKS5. CONCLUSIONS AND REMARKS
6. FUTURE WORK6. FUTURE WORK
MOTIVATION
►The drawbacks of back propagation - Structure of the network has to be guessed. The error is not guaranteed to converge to an absolute
minimum with a zero error. Use of analog neurons even where digital neurons would be
necessary.
► How does one determine the couplings between neurons in the successive layers to build a network achieving a task?
► How do we address to the problem of generalization? Association of correct outputs to new inputs which are not
present in the training set.
Introduction
►Creates a strictly feed forward network where connections exist from each layer only to the immediately succeeding layer.
► Units are added like tiles whenever they are needed. First unit is the MASTER UNIT.
► Master unit is trained using the pocket algorithm and checked for the exact output.
►New Ancillary units are added to this layer until we get a “faithful representation” of the problem.
– Each layer is constructed in a way that if two samples
i , j belong to different classes , then some node in the layer produces different outputs for i and j.
– Any two patterns with distinct outputs have distinct internal representation.
Introduction contd.......
In theIn the absence of faithfulness conditionabsence of faithfulness condition, , subsequentsubsequent
layers of nodes would belayers of nodes would be incapable of distinguishingincapable of distinguishing
between samples of different classesbetween samples of different classes.. Each subsequent layer will have fewer nodes than the Each subsequent layer will have fewer nodes than the
preceding layer, untill there is onlypreceding layer, untill there is only one nodeone node in thein the outermost layeroutermost layer for the two-class classification problem.for the two-class classification problem.
AdvantageAdvantage of this strategy -of this strategy -– Network structure is Network structure is dynamically generateddynamically generated through through
learning, not fixed in advance.learning, not fixed in advance.
– Network grows till a size which enables it to implement Network grows till a size which enables it to implement desired mapping facilitating desired mapping facilitating generalisationgeneralisation..
– Algorithm always produces a Algorithm always produces a finitefinite number of layers number of layers which which ensures termination.ensures termination.
An Example
Related Work
►Grossman et all in 1988 implemented another strategy which uses digital neurons within a structure fixed beforehand with one intermediate layer , fixing hidden unit values through trial and error method.
►Rujan and Marchand in 1988 formulated an algorithm which added neurons at will (similar to Tile algorithm) but has only one intermediate layer of hidden units . Couplings between i/p and hidden layers found by exhaustive search procedure inside a restricted set of possible couplings.
Tile Algorithm - Basic Notions
1. 1. Layered nets, made of binary units which can be in a plus or ed nets, made of binary units which can be in a plus or minus state.minus state.
2. A2. A unit i unit i in thein the LLthth layer layer is connected to theis connected to the NNL-1L-1+1 units+1 units of the of the
preceding layer and itspreceding layer and its state Sstate Sii(L)(L) is obtained by Threshold Rule :-is obtained by Threshold Rule :-
Basic Notions…..
where (w i , j L), j=1, 2 .... NL-1
, are the couplings.
Zero th unit in each layer acts as threshold
clamped in +1 state(S0(L) =1), so (wi,0
L )=bias.
For a given set of p0 (distinct) patterns of N0 binary units, we want to learn a given mapping
p0 ≤ 2N
0
Theorem of Convergence
We say that two patterns belong to the same class (for the layer L) if they have the same internal representation, which we call the prototype of the class.
The problem becomes to map these prototypes onto the desired output.
A 3 layer feed forward network through the tiling algorithm
Theorem for convergence
Theorem:-
Suppose that all the classes in layer L- 1 arefaithful, and that the number of errors of the
master unit, eL-1 , is non-zero. Then thereexists at least one set of weights w
connecting the L - 1 layer to the master unitsuch that e L ≤ eL-1 - 1 . Further more, one can
construct explicitly one such set of weights u.
Proof of convergence
1. Let τν =(τνj, j=0..NL-1)be the prototypes in layer L-1 and sν be the desired output(1 or –1).2.If the master unit of the Lth layer is connected to the
L-1th layer with the weight w(w1 = 1, w j = 0 for j ≠1),
then e L= eL-1.
Proof of convergence
Let µ0 be one of the patterns for which τ1μ
0 = -sµ
0, and let the set
of weights u be u1 = 1 and u j = λsμ0τµ
0 for j≠1 then
Where mµ = value of the master unit of prototype µ obtained from u.Prototype is stabilized i.e. mμ
0 = sμ0
if λ > 1/(NL-1)
Proof of convergence
Proof of convergence Consider other pattern where τ1
μ = sµ, the quantity
can take values -NL-1 , -NL-1 +2,…. NL-1.Because the representations in the L-1 layer are faithful, -NL-1 can never be obtained.
Thus one can choose λ = 1 / (NL-1 -1)So the patterns for which τ1
µ = sµ still remain.(m μ = sµ)
Hence u is one particular solution which, if used to define the master unit of layer L, will give
e L ≤ eL-1 – 1.
Generating the master unit
Using ‘pocket algorithm’
If the particular set u of the previous section is taken as initial set in the pocket algorithm, the output set w will always satisfy e L ≤ eL-1 – 1.
For each set of couplings w visited by the perceptron , we compute the number e( w ) of prototypes for which this set would not give the desired output, each prototype v being weighted by its volume Vv :
Where δ = Kronecker symbol.
Eventually we get the set of couplings which minimises e ( w ) among the w which have been visited by the perceptron . This ‘optimal’ set w* gives a certain number of errors eL = e ( w * ) .
The point of the pocket algorithm is just to speed up this convergence (i.e. generate less layers).
Generating the master unit
Building the ancillary units-Divide and Conquer
• The master unit is not equal to the desired output unit means that at least one of the two classes is unfaithful.
• We pick one unfaithful class and add a new unit to learn the mapping for the patterns μ belonging to this class only.
• The class with the smallest size is chosen.
•Above process is repeated until all classes are faithful.
(0, 0)
(1, 1)
Solving the XOR Problem
The two main classes that we have are
(0, 1)
(1, 0)
How do we classify?
(0,1)+ve -ve (1,1)
(1,0)+ve -ve (0,0)
Classification into two classes
(0,0)
(1,0)(0,1)(1,1)
ERRONEOUS CLASS
-+
Equivalent to an OR problem
Computation of the first neuron
X₂ X₁
Θ=0.5
w₁=1W₂=1
h₁
Here the hidden layer neuron h₁ computes the OR Function (x₁ +x₂)
Partial Network
FAITHFUL REPRESENTATION
(1,0)
(0,1)
(1,1)
(1,0)(0,1)
(1,1)
(0,0)
+ ve
- ve
+ ve
BREAKING THE ERRONEOUS CLASS
Computation of the Ancillary Unit
x1X₂
h1h2 0.5
1.01.0
-1.0 -1.0
-1.5Computes x1x2 Computes x1+x2
We solve h₂ after faithful classification which computes x1x2
FINAL OUTPUT LAYER
x2 x1
h1
(x1+x2
)
h₂
x1x2
y
h₁.h₂
0 0 0 1 0
0 1 1 1 1
1 0 1 1 1
1 1 1 0 0
The outputs of the hidden layer is ANDed to find the final output in the next layer.
y=h₁ . h₂
The FINAL NETWORK
x1x2
h1h2 0.5
1.01.0
-1.0 -1.0
-1.5
Computes x1x2 Computes x1+x2
1.0 1.0
0.5
yComputes h₁ . h₂
After AND ing and generating the output layer with a single MASTER UNIT giving the desired output.
SIMULATIONS
EXHAUSTIVE LEARNING(USING THE FULL SET OF 2^N PATTERNS)
PARITY TASK
RANDOM BOOLEAN FUNCTIONS
GENERALISATION
QUALITY OF CONVERGENCE
PARITY TASK
In the parity task for N0 Boolean units, the output should be 1 if the number of units in state +1 is even, and –1 otherwise.
Table 1. The network generated by the algorithm when learning the parity task with N0 = 6 .There is only one hidden layer of 6 units.
UNIT
No,iThreshold Coupling from the input layer to the hidden unit i
1 -55 +11 +11 +11 -11 -11 -11
2 +33 -11 -11 -11 +11 +11 +11
3 -11 +11 +11 +11 -11 -11 -11
4 -11 -11 -11 -11 +11 +11 +11
5 +33 +11 +11 +11 -11 -11 -11
6 -55 -11 -11 -11 +11 +11 +11
OUTPUT UNIT
Threshold Couplings from the hidden layer to the
output unit
+11 +11 +13 +13 +13 +13 +11
RANDOM BOOLEAN FUNCTIONRANDOM BOOLEAN FUNCTION
•A random Boolean function is obtained by drawing at random the output (±1 with equal probability) for each input configuration.•The numbers of layers and of hidden units increase rapidly with N0.
GENERALISATION
•Once a network has been built by the presentation of a training set, it performs the correct mapping for all the patterns in this training set. The question is: howdoes it perform on new patterns?•The number of training patterns is usually smaller than 2N
0.•The N0 input neurons are organized in a one-dimensional chain, and the problem is to find out whether the number of domain walls is greater or smaller than three.•domain wall
– The presence of two neighbouring neurons pointing in opposite directions
•When the average of domain walls are three in training patterns, the problem is harder than other numbers.
Quality of convergenceTo quantify the quality of convergence at least two parameters can be thought of :-
1.The number of errors e L which would be produced by the network if stopped at layer L.
2. The number of distinct internal representations (classes) p L in each layer L.
Its noticed that in the range 2 ≤L≤ 7 the decrease in e L is linear in L, and this seems to be the case also with p L .
3. Its tempting to use the slope of the linear decrease of thepercentage of errors as a measure of the complexity of the
problem to be learnt.
Comments
1. It is useful to limit as much as possible the number of hidden units.
2.There is a lot of freedom in the choice of the unfaithful classes to be learnt .
3.How to choose the maximum number of iterations which are allowed before one decides that the perceptron algorithm has not converged? - Adjustment parameter.
4.Optimization of the algorithm.
Conclusions
1. Presented a new strategy for building a feed-forward layered network for any given Boolean function.
2. Identified some possible roles of the hidden units: the master units and the ancillary units
3. The geometry, including the number of units involved and the connections, is not fixed in advance, but generated by a growth process.
FUTURE WORK 1. Improvement of the algorithm by investigating variants of the 1. Improvement of the algorithm by investigating variants of the perceptron algorithm used to build the hidden units.perceptron algorithm used to build the hidden units.
2. Systematic comparison of the performances(efficiency, computer time, 2. Systematic comparison of the performances(efficiency, computer time, size of the architecture)of the tiling algorithm with those of other size of the architecture)of the tiling algorithm with those of other Algorithms.Algorithms.
3. Generalization to neurons with continuous values whether the 3. Generalization to neurons with continuous values whether the algorithm works for continuous inputs and binary outputs . What happens algorithm works for continuous inputs and binary outputs . What happens when the data is conflicting?-identical patterns have different outputs.when the data is conflicting?-identical patterns have different outputs.
4. Finding a strategy which limits as much as possible the number of 4. Finding a strategy which limits as much as possible the number of units in each layer.units in each layer.
5. Generalization to several output units.5. Generalization to several output units.
THANK YOU...