training data
DESCRIPTION
Concept Map. Practical Design Issues. Learning Algorithm. Training Data. Topology. Initial Weights. Generalization. Fast Learning. Network Size. Noise weight sharing Small size Increase Training Data. Ceoss-validation & Early stopping. Occam’s Razor. Network Pruning. - PowerPoint PPT PresentationTRANSCRIPT
Training Data
Concept Map Practical Design Issues
Topology Initial Weights Learning Algorithm
Fast LearningNetwork Size Generalization
Occam’s RazorCeoss-validation
& Early stopping
Noiseweight sharing
Small sizeIncrease Training
Data
NetworkGrowing
NetworkPruning
BrainDamage
WeightDecay
Fast Learning
Training Data
Concept Map
Normalize
Scale
Present at Random
Cost Function
Activation Function
Adaptive slope
Architecture
Modular Committee
BP variants
No weightLearning
ForCorrectlyClassifiedPatterns
η
Chen & Mars
Momentum
Fahlmann’s
Other
Minimization Method
Conjugate Gradient
1. Practical Issues
Performance = f (training data, topology, initial weights, learning algorithm, . . .)
= Training Error, Net Size, Generalization.
(1) How to prepare training data, test data ?
- The training set must contain enough info to learn the task.
- Eliminate redundancy, maybe by data clustering.
- Training Set size N > W/
(N = # of training data, W = # of weights,
ε= Classification error permitted on Test data
Generalization error)
Chapter 4. Designing & Training MLPs
Ex. Modes of Preparing Training Data for Robot Control
The importance of the training data for tracking performance can not be overemphasized. Basically, three modes of training data selection are considered here. In the regular mode, the training data are obtained by tessellating the robot’s workspace and taking the grid points as shown in the next page. However, for better generalization, a sufficient amount of random training set might be obtained by observing the light positions in response to uniformly random Cartesian commands to the robot. This is the random mode. The best generalization power is achieved by the semi-random mode which evenly tessellates the workspace into many cubes, and chooses a randomly selected training point within each cube. This mode is essentially a blend of the regular and the random modes.
Regular modeRandom mode
Semi-random mode
Training Data Acquisition mode
Fig.10. Comparison of training errors and generalization errors for random and semi-random training methods.
0
5
10
15
20
25
30
35
40
45
50
0 50 100 150 200 250 300 350 400
RMS Error(mm)
Iteration
(a) Training error
RandomSemi-Random
0
5
10
15
20
25
30
35
40
45
50
0 50 100 150 200 250 300 350 400
RMS Error(mm)
Iteration
(b) Test error
RandomSemi-Random
(2) Optimal Implementation
A. Network Size
Occam’s Razor :
Any learning machine should be sufficiently large to solve a given problem, but not larger. A scientific model should favor simplicity or shave off the fat in the model. [Occam = 14th century British monk]
a. Network Growing: Start with a few / add more (Ref. Kim, Modified Error BP Adding Neurons to Hidden Layer, J. of
KIEE 92/4) If E > 1 and E < 2, Add a hidden node.
Use the current weights for existing weights and small random values for newly added weights as initial weights for new learning. b. Network Pruning ① Remove unimportant connections After brain damage, retrain the network. Improves generalization. ② Weight decay: after each epoch
c. Size Reduction by Dim. Reduction or Sparse Connectivity in
Input Layer [e.g. Use 4 random instead of 8 connections]
ww )( 1'
Number of Epochs
E
E
Good
train(X)
test(O)
T
X
R
U
R'
T : Training Data
X : Test Data
R : NN with Good Generalization
R' : NN with Poor Generalization
Overfitting(due to too many traning samples, weights) noise
Poor
train(X)
test(O)
B. Generalization : Train (memorize) and
Apply to an Actual problem (generalize)
LearningSubset
ValidationSubset
Test Set
Training Set
Mean-Square
ErrorValidation
sample
Trainingsample
Earlystopping
point
0 Number of epochs
For good generalization, train with Learning Subset. Check on validation set. Determine best structure based on Validation Subset [10% at every 5-10 iterations]. Train further with the full Training Set. Evaluate on test set. Statistics of training (validation) data must be similar to that of test (actual problem) data.
Tradeoff between training error and generalization !
Stopping Criterion Classification : Stop upon no error
Function Approximation : check EE ,
An Example showing how to prepare the various data sets to learn an unknown function from data samples
Other measures to improve generalization.
• Add Noise (1-5 %) to the Training Data or Weights.
• Hard (Soft) Weight Sharing (Using Equal Values for Groups of Weights)
Can Improve Generalization.
• For fixed training data, the smaller the net the better the generalization.
• Increase the training set to improve generalization.
• For insufficient training data, use leave-one (some)-out method = Select an example and train the net without this example, evaluate with
this unused example.
• If still does not generalize well, retrain with the new problem data.
C. Speeding Up [Accelerating] Convergence
- Ref. Book by Hertz, AI Expert Magazine 91/7
To speed up calculation itself:Reduce # Floating Point Ops by Using a Fixed Point Arithmetic And Use a Piecewise-Linear approximation for the sigmoid.
What will happen if more than 5-10 % validation data are used ?
Consider 2 industrial assembly robots for precision jobs made by the same company with an identical spec. If the same NN is used for both, then the robots will act differently. Do we need better generalization methods to compensate for this difference ?
Large N may increase noisy data. However, wouldn’t large N offset the problem by yielding more reliability ? How big an influence would noise have upon misguided learning ?
Wonder what measures can prevent the local minimum traps.
Students’ Questions from 2005
Is there any mathematical validation for the existence of a stopping point in validation samples ?
The number of hidden nodes are adjusted by a human. An NN is supposed to self-learn and therefore there must be a way to automatically adjust the number of the hidden nodes.
① Normalize Inputs, Scale Outputs. Zero mean, Decorrelate (PCA) and Covariance equalization
② Start with small uniform random initial weights [for tanh] :
rr )0(w ③ Present training patterns in random (shuffled) order (or mix different
classes).
④ Alternative Cost or Activation Functions
Ex.
Cost
Use with as targets or
( , , at )
infanw
4.2
s
3
2tanh716.1 ssinhtan
2 1
k
kkP
r
kkkP
ydEvsydE 2)(.
1
1)1( 1)0( max)( s 1s
⑤ Fahlman's Bias to Ensure Nonzero ))(1.0'( kk yt for output units only or for all units
⑥ Chen & Mars Differential step size 0.1 = =
outerinner
)( kk yt
⑦ (Accelerating BP Algorithm through Omitting Redundant Learning, J. of KIEE 92/9 ) If , Ep < do not update weight on the pth training pattern – NO BP
p
E
p
Cf. Principe’s Book recommends . Best to try diff. values.5 ~2 = For output units only -- drop .'
'
⑧ Ahalt - Modular Net
M L P 1
M L P 2
x
1y
2y
vary in
⑨ Ahalt - Adapt Slope (Sharpness) Parameters
J/
J/
es s
ww
)1/(1)(
⑩ Plaut Rule
infanpq
1
⑪ Jacobs - Learning Rate Adaptation
[Ref. Neural Networks, Vol. 1, No. 4, 88. ]
+
Reason for Slow Convergence
a. Momentum : )1()(
tJ
t ww
w
)(
)(
0 it
itJt
i
i
w
In plateau,
where is the effective learning rate
ww
J
1
1
without momentum
with momentum
b. rule : where )(w
)()(wt
Jtt
iii
)(t
J
iw
)(ti
i
t
0)()1()1(
0)()1(
ttift
ttifK
iii
iii
)1()()1()( ttt iii
])1()([)1( tt ii
For actual parameters to be used, consult Jacob’s paper and also “Getting a fast break with Backprop”, Tveter, AI Expert Magazine, excerpt from pdf files that I provided.
Students’ Questions from 2005
Is there any way to design a spherical error surface for faster convergence ?
Momentum provides inertia to jump over a small peak.
Parameter Optimization technique seems to a good help to NN design.
I am afraid that optimizing even the sigmoid slope and the learning rate may expedite overfitting.
In what aspect is it more manageable to remove the mean, decorrelate, etc. ?
How does using a bigger learning rate for the output layer help learning ?
Does the solution always converge if we use the gradient descent ?
Are there any shortcomings in using fast learning algorithms ?In the Ahalt’s modular net, is it faster for a single output only or all the outputs than an MLP ?Various fast learning methods have been proposed. Which is the best one ? Is it problem-dependent ?The Jacobs method cannot find the global min. for an error surface like:
⑫ Conjugate Gradient : Fletcher & Reeves
Line Search )()()1( nnn sww )(nw
)(ns
)]()([ nnEMin sw
If is fixed and )()]([)( nnEn gws Gradient Descent
)](*)([)]()([Min nnEnnE gwgw
If
0)](*)([)()]1([ nnEnnE gwgw
)1()1(, nn gs
)1( ng Steepest Descent
)()1( nn gg
)1( nw
GradientDescent SteepestDescent ConjugateGradient
(n)]E[w
)(ns
Gradient D.+ Line Search Steepest Descent + Momentum
GD SD
Momentum CG
w(n+1)w(n)
= 1)](nE[ w
w(n+2)
w(n)
w(n+1)
w(n+2)
1)](nE[* w
(n)]E[* w =)(* ns
w(n-1)
w(n)
(n)]E[w
w(n+1)
w(n) )(* ns w(n+1)
1)](nE[ w s(n+1)
)(ns
2) Choose such that
If : Conjugate Gradient
1) Line Search )()1()1( nnn sgs
jiww
EHessianHwhere
2
0)]1()1([)( nHnn sgs )()()( 00 wwww HEEFor
0)1()( nHn ss
From Polak-Ribiere Rule :
2)(
)1())()1((
n
nnn
g
ggg
0)()1(0)]()([
nnnnE sgsw
0)()]1()1([ nnnE ssw
N
YY
End
START
N
Line Search
Initialize)]0([
)0()0(
w
gs
E
))()((min nnE sw
)(*)()1( nnn sww
)]1([)1( nEn wg
)()1()1( nnn sgs
1))(( nE w
max2))(( nnnE orw
)(ns)(nw )1( nw
)(ns
)1( ng )1( ns
+ +
Steepest Descent Conjugate Gradient
Each step takes a line search.
For N-variable quadratic functions, converges in N steps at most
Recommended:
Steepest Descent + n steps of Conjugate Gradient
+ Steepest Descent + n steps of Conjugate Gradient
+
Comparison of SD and CG
X. Swarm Intelligence What is “swarm intelligence” and why
is it interesting?
Two kinds of swarm intelligence particle swarm optimization ant colony optimization
Some applications
Discussion
What is “Swarm intelligence”?
“Swarm Intelligence is a property of systems of non-intelligent agents exhibiting collectively intelligent behavior.”
Characteristics of a swarm distributed, no central control or data source no (explicit) model of the environment perception of environment ability to change environment
I can’t do…
We can do…
Group of friends each having a metal detector are on a treasure finding mission.
Each can communicate the signal and current position to the n nearest neighbors. If you neighbor is closer to the treasure than him, you can move closer to that neighbor thereby improving your own chance of finding the treasure. Also, the treasure may be found more easily than if you were on your own.
Individuals in a swarm interact to solve a global objective in a more efficient manner than one single individual could. A swarm is defined as a structured collection of interacting organisms [ants, bees, wasps, termites, fish in schools an birds in flocks] or agents. Within the swarms, individuals are simple in structure, but their collective behaviors can be quite complex. Hence, the global behavior of a swam emerges in a nonlinear manner from the behavior of the individuals in that swarm.
The interaction among individuals plays a vital role in shaping the swarm’s behavior. Interaction aids in refining experiential knowledge about the environment, and enhances the progress of the swarm toward optimality. The interaction is determined genetically or throgh social interaction.
Applications: function optimization, optimal route finding, scheduling, image and data analysis.
Why is it interesting? Robust nature of animal problem-solving
simple creatures exhibit complex behavior behavior modified by dynamic environmente.g.) ants, bees, birds, fishes, etc,.
Two kinds of Swarm intelligence
Particle swarm optimization Proposed in 1995 by J. Kennedy and R. C. Eberhar
t based on the behavior of bird flocks and fish scho
ols Ant colony optimization
defined in 1999 by Dorigo, Di Cargo and Gambardella
based on the behavior of ant colonies
1. Particle Swarm Optimization
Population-based method Has three main principles
a particle has a movement this particle wants to go back to the best previously visited
position this particle tries to get to the position of the best positioned
particles
Four types of neighborhood star (global) : all particles are neighbors of all
particles ring (circle) : particles have a fixed number of
neighbors K (usually 2) wheel : only one particle is connected to all
particles and act as “hub” random : N random conections are made betw
een the particles
algorithm
Initialization
Calculate performance
Update best particle
Move each particle
Until system converges
: xid(0) = random value, vid(0) = 0;
: F (xid(t)) = ? (F : performance)
: F (xid(t)) is better than the pbest
-> pbest = F(xid(t)), pid = xid(t), Same for the gbest: See next slide
Particle Dynamics
for convergence c1+ c2 < 4 [Kennedy 1998]
Examples
http://uk.geocities.com/markcsinclair/pso.html
http://www.engr.iupui.edu/~shi/PSO/AppletGUI.html
⑭ Local Minimum Problem• Restart with different initial weights, learning rates, and number
of hidden nodes
• Add (and anneal) noise a little (zero mean white Gaussian) to
weights or training data [desired output or input (for better
generalization) ]
• Use {Simulated Annealing} or {Genetic Algorithm Optimization then BP} ⑮ Design aided by a Graphic User Interface– NN Oscilloscope
Look at Internal weights/Node Activities with Color Coding
⑬ Fuzzy control of Learning rate, Slope (Principe’s, Chap. 4.16)
Students’ Questions from 2005
When the learning rate is optimized and initialized, there must be a rough boundary for it. Just an empirical way to do it ?
In Conjugate Gradient, s(n) = -g(n+1) …
The learning rate annealing just keeps on decreasing the error as n without looking at where in the error surface the current weights are. Is this OK ?
Conjugate Gradient is similar to Momentum in that old search direction is utilized in determining the new search direction. It is also similar to rule using the past trend.
Is CG always faster converging than the SD ?
Do the diff. initial values of the weights affect the output results ? How can we choose them ?