committee machines and mixtures of experts neural networks 12

Committee Machines and Mixtures of Experts

Neural Networks 12

Committee Machines

When generating eg an MLP one has to test and discard many different networks some of which are only slightly worse than the ‘best’ one

Such a procedure is very wasteful of resources

Also, judgement of generalisation performance is noisy due to dependence on data

Idea: combine the outputs of several machines and thus reap the benefits of all of the work, with little additional computation

Performance can be better than best single network in isolation without need to determine this network

Can be useful especially if one has to arbitrarily choose between 2 networks

eg RBFN with regularisation has roughly the same performance as MLP with pre-processing by PCA. Which one is best? Choose both!

Why should this work? Intuition: 3 networks, all are good at getting 2 classes correct but can’t distinguish a third. Each works on disjoint subsets of classes. Together they have the knowledge to solve the problem exactly …

… but how do we combine their knowledge? EG averaging the results

L

ii

L

i

e1

2

1iAV L

1E

L

1 E

Suppose we have L trained experts with outputs yi(x) for a regression problem to approximate h(x) each with an error of ei. Then we can write:

ii exhxy )()(

][)()(E 22i ii exhxy

Thus the sum of squares error for network yi is:

Where [.] denotes the expectation (average or mean value).

Thus the average error for the networks acting individually is:

Averaging Results: Mean Error for Each Network

L

iiCOM xyxy

1

)(L

1 )(

Averaging Results: Mean Error for CommitteeSuppose instead we form a committee by averaging the outputs yi to get the committee prediction:

2

1

2

1

2

L

1)()(

L

1))( )((

L

ii

L

iiCOMCOM exhxyxhxyE

This estimate will have error:

AV1

22

1COM E

L

1

L

1E

L

ii

L

ii ee

Indeed, if the errors are uncorrelated ECOM = EAV / L but this is unlikely in practice as errors tend to be correlated

Thus, by Cauchy’s inequality:

Bias-Variance Trade-offPreviously in network training we have seen a trade-off between getting a good fit to the data and getting a smooth, general mapping (and in prob dens est, need smoothing params to smooth but not obscure data)

To understand this it is useful to decompose the prediction error into bias and variance components

Bias is essentially the error that arises from the network not fitting the data ie mean square error between average (over all possible training sets D) of outputs and the targets

Conversely Variance is the error that arises due to the variabilities in the different data sets ie the mean square error between the average output and outputs

Total error is sum of 2 components (1st term bias2, 2nd term variance)

222 )])]([)([())()]([(]))(-)([( xyxyxhxyxhxy DDDD

Intuitively can see there is a trade-off between the 2 if one considers size of training set: small set => low bias, high variance, big set => higher bias lower variance

Similarly with length of training: how much attention do we pay to this choice of training data?

Eg ignore data: whatever choice of D pick y(x) = g(x). Then variance vanishes since E[y] = y

Alternatively, can fit data exactly: here suppose targets are:

t= h(x) + e where e is added noise

Thus bias vanishes since E[y (x)] = t(x). Therefore all error is due to variance and is: E[(y (x) - h(x))2] = E[e2]

Which is variance of the noise added to the data

The reduction in error can be viewed as coming from a reduction in the variance of each individual network as we are averaging over several networks

Each individual net should not have a bias which minimises the bias-variance trade-off but should in fact be overtrained to have a low bias as the extra variance can be removed by averaging

Can we do better? What if we weight the average so that members which have better predictions have more input

Can be shown, via Lagrange multipliers (pp 367-369, Bishop) that we can do better and it is best if we increase the spread of predictions of the nets without increasing the errors

Intuitively appealing: we want specialised experts (low bias) that specialise on different parts of the problem (spread of predictions)

Static committee machines

Static committee machines are ones where the responses of experts are combined without the mechanism seeing the input

2 main methods: ensemble averaging and boosting

Expert1

Expert2

ExpertL

…

Combiner output

y1(n)

y2(n)

yL(n)

Input x(n)

Ensemble averagingPerform a weighted average of the outputs (NOT the same as averaging the performance)

Why? If weights are all equal, many bad classifiers can outweigh fewer good classifiers

Analagous to voting which is used for classification (machnies vote for which class pattern belongs to: most votes wins)

However, if weights are based on performance of the machine, one classifier which is wrong but thinks it is right can outweigh many that are right but are not so sure

Problematic since we want heterogenous distribution of expertise ie if we have one net which is good apart from on one bit, it will have good performance and so will outweigh another network which knows the bit the first one doesn’t

Boosting

1. Boosting by filtering. Filter the data via a weak learning machine. Assumes infinite (lots of) data, but low memory requirements

2. Boosting by subsampling. Fixed size data set ‘resampled’ according to some probability distribution during training

In ensemble averaging all nets are trained on the same data

In Boosting we generate several different subsets of data and train our possibly weak networks (ie nets whose peformace is slightly more than 50%) on them so that they specialise on different bits

Can be used to improve the performance of any learning machine (by eg biasing samples towards difficult examples)

Will examine 2 different approaches here:

Boosting by filteringHave 3 networks: Expert1, Expert2, and Expert3

1. Train Expert1 on a set of examples N1 of size N

2. Filter the data through Expert1 to get 2nd data set N2 via:

Flip a coin.

If Heads: pass new data through Expert1 until it misclassifies a data point. Add this point to N2.

Tails do the opposite: ie discard incorrect until 1 correct which is added to N2

Repeat until N2 is of size N

Note that if Expert1 is tested on N2 the distribution of data points is such that it would get 50 % correct => the distribution is different to N1

3. Train Expert2 on N2 then use both Experts to generate anew training set N3 Via:

Pass a new pattern through Experts 1 and 2. If they agree on their classification, discard the pattern. If they disagree add to N3

Continue till N3 is of size N

Expert 3 is now trained on N3

Note that both N2 and N3 contain more “hard-to-learn” patterns since performance of the experts > 50%

The output of the committee of machines is formed by adding the outputs generated by each expert

NB Needs a lot of data

Expert 1A A A … B A A …

N2

Expert 1A A A … B A B …

50 % of time

Therefore, Expert 1 gets 50% of N2 right and 50% wrong

Since Expert 1’s performance is more than 50% N2 is different to N1 and has more ‘harder’ patterns in it

Expert 1A A A … B A A …

N3

Expert 2A A A … A A B …

Roughly 50 % of time

Here N3 is made up of patterns that one (but not both) of the other 2 networks cannot classify and that are therefore in hard to learn bits of the input space

Example: pattern classification. Boundaries given by solid lines. Dots in one class, crosses in other. Figure shows distribution of 3 data-sets

Notice that N1 has a unifrom distribution of points whereas N2 and N3 successively concentrate data in hard to classify regions

First 3 figs show decision regions from 3 experts and last one the region for a combined expert formed by summing outputs of 3 experts

Expert1

E=75%

Expert2

E=71%

Expert3

E=69%

Combind Expert

E=92%

Boosting by subsampling

The AdaBoost algorithm: adaptively resamples the data set so can be used with a datset (X) of a fixed size

Again uses a weak learning model (network) but adjusts adaptively to the errors of the model (hence the name)

Algorithm works as follows: at time n the algorithm provides a training sample to the network drawn from X using a probability distribution Dn. Which is used to train a hypothesis (network) hn

Process continues for T timesteps after which the algorithm combines the outputs of the T networks generated using a weighted average

The distribution Dn+1 is calculated from Dn by decreasing the probability of an input pattern being picked if hn classified it correctly, thus focussing on more difficult patterns

Adaboost AlgorithmAssign every example an equal weight 1/N ie Dt(i)=1/n

For t = 1, 2, …, T Do Obtain a hypothesis (classifier) h(t) using Dt(i) to generate a training sampleCalculate the weighted error e(t) of h(t) by summing Dt(i) over all the points incorrectly classifiedIf e(t) > 1/2, repeat for loop with different sample Make Dt+1(i) by multiplying the probabilities of all patterns classified correctly by b(t) = e(t)/(1-e(t)): gives higher weighting for lower errors e=0.5, b=1; e=0.2, b=0.25; e=0.1, b=0.09Normalize w(t+1) to sum to 1

Output a weighted sum of all the hypotheses, with weights specified by accuracy on the training set via put x in class where sum over hypotheses that put x in that class of log(1/b(t))

Dynamic committee machines: input signal is directly involved in combining ouputs

Eg Mixtures of experts and hierarchical mixtures of experts

Gating network decides the weighting of each network

Dynamic committee machines

Expert1

Expert2

ExpertL

…

Gating network

output

y1(n)

y2(n)

yL(n)

Input x(n)

g1(n)

g2(n)

gL(n)

Mixture of expertsHave K networks or experts and it is assumed that different experts work best on different bits of input space

Also have a gating network which mediates between them

Let output from j’th expert be:

K

ii

jj

u

uxg

1

)exp(

)exp( )(

And set the j’th output from the gating network be the softmax (sort of differentiable and continuous winner-takes-all):

where

xwxy Tjj )(

xaxu Tjj )(

K

jjj yg

1

Output

Thus gj is the ‘probability’ of expert j being correct and overall output is:

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1 2 3 4 5 6 7 8 9 100

1

2

3

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

original

softmaxed

original

softmaxed

Find parameters a and w together via various search algorithms

Hierarchical mixtures of experts

committee machines and mixtures of experts neural networks 12

Documents

data slide

average error

network slide

y slide

error of e

mean error

biasvariance tradeoff

variance components