feed-forward artificial neural networks
DESCRIPTION
Feed-Forward Artificial Neural Networks. AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University. Binary Classification Example. Example: - PowerPoint PPT PresentationTRANSCRIPT
1
Feed-Forward Artificial Neural Networks
AMIA 2003, Machine Learning TutorialConstantin F. Aliferis & Ioannis Tsamardinos
Discovery Systems LaboratoryDepartment of Biomedical Informatics
Vanderbilt University
2
Binary Classification Example
Value of predictor 1
Val
ue o
f pre
dict
or 2
Example:
•Classification to malignant or benign breast cancer from mammograms
•Predictor 1: lump thickness
•Predictor 2: single epithelial cell size
3
Possible Decision Areas
Value of predictor 1
Val
ue o
f pre
dict
or 2
Class area: red circles
Class area: Green triangles
4
Possible Decision Areas
Value of predictor 1
Val
ue o
f pre
dict
or 2
5
Possible Decision Areas
Value of predictor 1
Val
ue o
f pre
dict
or 2
6
Binary Classification Example
Value of predictor 1
Val
ue o
f pre
dict
or 2
The simplest non-trivial decision function is the straight line (in general a hyperplane)
One decision surface
Decision surface partitions space into two subspaces
7
Specifying a Line
Line equation: w2x2+w1x1+w0 = 0
Classifier: If w2x2+w1x1+w0 0
Output Else
Output
x1
x 2
w2x2+w1x1+w0 = 0
w2x2+w1x1+w0 > 0
w2x2+w1x1+w0 < 0
8
Classifying with Linear Surfaces
Use 1 for Use -1 for Classifier becomes
sgn(w2x2+w1x1+w0)= sgn(w2x2+w1x1+w0x0),
with x0=1 alwaysx1
x 2
w2x2+w1x1+w0 = 0
w2x2+w1x1+w0 > 0
w2x2+w1x1+w0 < 0
)sgn( predictors ofnumber ,)sgn(0
xwnxwn
iii
9
The Perceptron
1Input:(attributes of patient to classify)
Output: classification of patient (malignant or benign)
W4=3
W2=0W2=-2 W1=4 W0=2
x0x4 x3 x2 x1
10
The Perceptron
2 3.1 4 10Input:(attributes of patient to classify)
Output: classification of patient (malignant or benign)
W4=3
W2=0W2=-2 W1=4 W0=2
x0x4 x3 x2 x1
11
The Perceptron
3 3.1 4 10Input:(attributes of patient to classify)
Output: classification of patient (malignant or benign)
W4=3
W2=0W2=-2 W1=4 W0=2
x0x4 x3 x2 x1
30
n
iiixw
12
The Perceptron
2 3.1 4 10Input:(attributes of patient to classify)
Output: classification of patient (malignant or benign)
W4=3
W2=0W2=-2 W1=4 W0=2
x0x4 x3 x2 x1
30
n
iiixw
sgn(3)=1
13
The Perceptron
Input:(attributes of patient to classify)
Output: classification of patient (malignant or benign)
2 3.1 4 10
W4=3
W2=0W2=-2 W1=4 W0=2
x0x4 x3 x2 x1
30
n
iiixw
sgn(3)=1
1
14
Learning with Perceptrons
Hypotheses Space: Inductive Bias: Prefer hypotheses that do not
misclassify any of the training instances (or minimize an error function).
Search method: perceptron training rule, gradient descent, etc.
Remember: the problem is to find “good” weights
}|{ 1 nwwH
15
Training Perceptrons Start with random
weights Update in an intelligent
way to improve them Intuitively:
Decrease the weights that increase the sum
Increase the weights that decrease the sum
Repeat for all training instances until convergence
2 3.1 4 10
30 -2 4 2
x0x4 x3 x2 x1
30
n
iiixw
sgn(3)=1
1True Output: -1
16
Perceptron Training Rule
t : output the current example should have o: output of the perceptron xi: value of predictor variable i t=o : No change (for correctly classified examples)
iii
ii
www
xotw
'
)(
17
Perceptron Training Rule
t=-1, o=1 : sum will decrease t=1, o=-1 : sum will increase
iii
ii
www
xotw
'
)(
2' )( iiiiiiiii xotxwxwxwxw
)sgn( iixw
18
Perceptron Training Rule
In vector form:
iii
ii
www
xotw
'
)(
xotww )('
19
Example of Perceptron Training: The OR function
x1
x 2
0 1
1
-1
1 1
1
20
Example of Perceptron Training: The OR function
Initial random weights:
Define line:x1=0.5
x1
x 2
0 1
1
-1
1 1
1
05.010 12 xx
21
Example of Perceptron Training: The OR function
Initial random weights:
Defines line:x1=0.5
x1
x 2
0 1
1
-1
1 1
1
05.010 12 xx
x1=0.5
Area where classifier outputs 1
Area where classifier outputs -1
22
Example of Perceptron Training: The OR function
x1
x 2
0 1
1
-1
1 1
1
x1=0.5
Only misclassified example x1=0, x2=1
23
Example of Perceptron Training: The OR function
x1
x 2
0 1
1
-1
1 1
1
x1=0.5
Only misclassified example x1=0, x2=1
05.010 : LineOld 12 xx
5.0,1,1
1,0,115.0,1,0
1,0,1))1(1(5.05.0,1,0
)(
'
'
'
'
w
w
w
xotww
24
Example of Perceptron Training: The OR function
x1
x 2
0 1
1
-1
1 1
1
x1=0.5
Only misclassified example x1=0, x2=1
05.011 :New 12 xx
For x2=0, x1=-0.5
For x1=0, x2=-0.5
So, new line is:
25
Example of Perceptron Training: The OR function
x1
x 2
0 1
1
-1
1 1
1
x1=0.5
05.011 :New 12 xx
For x2=0, x1=-0.5
For x1=0, x2=-0.5
-0.5
-0.5
05.011 12 xx
26
Example of Perceptron Training: The OR function
x1
x 2
0 1
1
-1
1 1
1
-0.5
-0.5
05.011 12 xx
Misclassified example
Next iteration:
5.0,1,1
1,0,015.0,1,1
1,0,0)11(5.05.0,1,1
)(
'
'
'
'
w
w
w
xotww
27
Example of Perceptron Training: The OR function
x1
x 2
0 1
1
-1
1 1
1
-0.5
-0.5
05.011 12 xx
Perfect classification
No change occurs next
28
Analysis of the Perceptron Training Rule Algorithm will always converge within finite
number of iterations if the data are linearly separable.
Otherwise, it may oscillate (no convergence)
29
Training by Gradient Descent
Similar but: Always converges Generalizes to training networks of
perceptrons (neural networks) Idea:
Define an error function Search for weights that minimize the error, i.e.,
find weights that zero the error gradient
30
Setting Up the Gradient Descent Mean Squared Error:
Minima exist where gradient is zero:
Dd
dd otwE 2)(21)(
Ddd
idd
Dddd
idd
Dddd
i
Dddd
ii
ow
ot
otw
ot
otw
otww
E
)()(
)()(221
)(21
)(21
2
2
31
The Sign Function is not Differentiable
0except everywhere ,0)(
di
dd
i
owoo
w
x 20 1
1
-1
1 1
1
x1=0.5
32
Use Differentiable Transfer Functions
Replace
with the sigmoid
)sgn( xw
))(1)(()(1
1)(
)(
ysigysigdyydsig
eysig
xwsig
y
33
Calculating the Gradient
Ddddddd
Dddidddd
Ddd
idddd
Dd i
d
d
ddd
Ddd
idd
Ddd
idd
Ddd
idd
i
xsigsigotwE
xsigsigot
xww
sigsigot
wsigot
sigw
ot
xwsigw
ot
ow
otwE
))(1)(()()(
))(1)(()(
)())(1)(()(
)()(
))(()(
))(()(
)()(
,
34
Updating the Weights with Gradient Descent
Each weight update goes through all training instances
Each weight update more expensive but more accurate
Always converges to a local minimum regardless of the data
Ddddddd xsigsigotww
wEww
))(1)(()(
)(
35
Multiclass Problems
E.g., 4 classes Two possible solutions
Use one perceptron (output unit) and interpret the results as follows:
Output 0-0.25, class 1 Output 0.25 – 0.5, class 2 Output 0.5 – 0.75, class 3 Output 0.75 – 1, class 4
Use four output units and a binary encoding of the output (1-of-m encoding).
36
1-of-M Encoding
Assign to class with largest output
x0x4 x3 x2 x1
Output for Class 1
Output for Class 2
Output for Class 3
Output for Class 4
37
Feed-Forward Neural Networks
x0x4 x3 x2 x1
Output Layer
Hidden Layer 2
Hidden Layer 1
Input Layer
38
Increased Expressiveness Example: Exclusive OR
x 2
0 1
1
-1
1
1
-1
No line (no set of three weights) can separate the training examples (learn the true function).
x2 x1 x0
w2 w1 w0
39
Increased Expressiveness Examplex 2
0 1
1
-1
1
1
-1
x2 x1 x0
w2,1 w1,1
w0,1
w2,2
w1,2 w0,2
w’1,1 w’2,1
40
Increased Expressiveness Examplex 2
0 1
1
-1
1
1
-1
H1
x2 x1 x0
1 -1
0.5
H2
-1
1 0.5
O
11
X1 X2 C
0 0 -1
0 1 1
1 0 1
1 1 -1
41
Increased Expressiveness Examplex 2
-1
1
1
-1
H1
x2 x1 x0
1 -1
0.5
H2
-1
1 0.5
O
11
X1 X2 C H1 H2 O
T1 0 0 -1
T2 0 1 1
T3 1 0 1
T4 1 1 -1
x1
42
Increased Expressiveness Examplex 2
-1
1
1
-1
0 0
H1
x2 x1
1x0
1 -1
-0.5
H2
-1
1 -0.5
O
11
X1 X2 C H1 H2 O
T1 0 0 -1
T2 0 1 1
T3 1 0 1
T4 1 1 -1
x1
43
Increased Expressiveness Examplex 2
-1
1
1
-1
0 0
-1
x2 x1
1x0
1 -1
-0.5
-1
-1
1 -0.5
-1
11
X1 X2 C H1 H2 O
T1 0 0 -1 -1 -1 -1
T2 0 1 1
T3 1 0 1
T4 1 1 -1
x1
44
Increased Expressiveness Examplex 2
-1
1
1
-1
0 1
-1
x2 x1
1x0
1 -1
-0.5
1
-1
1 -0.5
1
11
X1 X2 C H1 H2 O
T1 0 0 -1 -1 -1 -1
T2 0 1 1 -1 1 1
T3 1 0 1
T4 1 1 -1
x1
45
Increased Expressiveness Examplex 2
-1
1
1
-1
1 0
1
x2 x1
1x0
1 -1
-0.5
-1
-1
1 -0.5
1
11
X1 X2 C H1 H2 O
T1 0 0 -1 -1 -1 -1
T2 0 1 1 -1 1 1
T3 1 0 1 1 -1 1
T4 1 1 -1
x1
46
Increased Expressiveness Examplex 2
-1
1
1
-1
1 1
-1
x2 x1
1x0
1 -1
-0.5
-1
-1
1 -0.5
1
11
X1 X2 C H1 H2 O
T1 0 0 -1 -1 -1 -1
T2 0 1 1 -1 1 1
T3 1 0 1 1 -1 1
T4 1 1 -1 -1 -1 -1
x1
47
Increased Expressiveness Examplex 2
-1
1
1
-1
1 1
-1
x2 x1
1x0
1 -1
-0.5
-1
-1
1 -0.5
1
11
X1 X2 C H1 H2 O
T1 0 0 -1 -1 -1 -1
T2 0 1 1 -1 1 1
T3 1 0 1 1 -1 1
T4 1 1 -1 -1 -1 -1
x1
48
Increased Expressiveness Examplex 2
-1
1
1
-1
1 1
-1
x2 x1
1x0
1 -1
-0.5
-1
-1
1 -0.5
1
11
X1 X2 C H1 H2 O
T1 0 0 -1 -1 -1 -1
T2 0 1 1 -1 1 1
T3 1 0 1 1 -1 1
T4 1 1 -1 -1 -1 -1
x1
49
From the Viewpoint of the Output Layer
H1 H2
O
11
C H1 H2 O
T1 -1 -1 -1 -1
T2 1 -1 1 1
T3 1 1 -1 1
T4 -1 -1 -1 -1
T1
T3
T2x1
x 2
T4
Mapped By Hidden Layer to:
T1T3
T2
H1
H2
T4
50
From the Viewpoint of the Output Layer
T1
T3
T2x1
x 2
T4
Mapped By Hidden Layer to:
T1T3
T2
H1
H2
T4
•Each hidden layer maps to a new instance space
•Each hidden node is a new constructed feature
•Original Problem may become separable (or easier)
51
How to Train Multi-Layered Networks
Select a network structure (number of hidden layers, hidden nodes, and connectivity).
Select transfer functions that are differentiable.
Define a (differentiable) error function. Search for weights that minimize the error
function, using gradient descent or other optimization method.
BACKPROPAGATION
52
How to Train Multi-Layered Networks Select a network structure
(number of hidden layers, hidden nodes, and connectivity).
Select transfer functions that are differentiable.
Define a (differentiable) error function.
Search for weights that minimize the error function, using gradient descent or other optimization method. x2 x1 x0
w2,1 w1,1
w0,1
w2,2
w1,2 w0,2
w’1,1 w’2,1
Dd
dd otwE 2)(21)(
53
Back-Propagating the Error
x2 x1 x0
Output unit(s):
Hidden unit(s):
Hidden unit(s):
ijw
jkw
wji: weight vector from unit j to unit i
xj : jth output
i, index for output layer
j, index for hidden layer
k, index for input layer
ix
jx
kx
54
Back-Propagating the Error
ii
d
Dddidddd
i
xsigsigotwE
xw
xsigsigotwE
))(1)(()(
example trainingsinglea of onContributi
))(1)(()( ,
x2 x1 x0
ijw
jkw
ix
jx
kx
55
Back-Propagating the Error
x2 x1 x0
ijw
jkw
ix
jx
kx
)(
))(1)((
ot
sigsigxwE
jij
56
Back-Propagating the Error
x2 x1 x0
ijw
jkw
ix
jx
kx?
))(1)((
j
kjjk
sigsigxwE
57
Back-Propagating the Error
x2 x1 x0
ijw
jkw
ix
jx
kx
Outputsiiijj
kjjk
w
sigsigxwE
))(1)((
58
Back-Propagating the Error
x2 x1 x0
ijw
jkw
ix
jx
kx
miw
mx )( mmm xt
Outputsm
mmii w
NextLayeri
iijj w
rttttrtr
tr
trtrtr
xsigsigwwwwEww
))(1)((
)(
59
Training with BackPropagation
Going once through all training examples and updating the weights: one epoch
Iterate until a stopping criterion is satisfied The hidden layers learn new features and
map to new spaces Training reaches a local minimum of the error
surface
60
Overfitting with Neural Networks
If number of hidden units (and weights) is large, it is easy to memorize the training set (or parts of it) and not generalize
Typically, the optimal number of hidden units is much smaller than the input units
Each hidden layer maps to a space of smaller dimension
Training is stopped before local minimum is reached
61
Typical Training Curve
Epoch
Error
Real Error or on an independent validation set
Error on Training Set
Ideal training stoppage
62
Example of Training Stopping Criteria
Split data to train-validation-test sets Train on train, until error in validation set is
increasing (more than epsilon the last m iterations, or
until a maximum number of epochs is reached Evaluate final performance on test set
63
Classifying with Neural Networks
Determine representation of input: E.g., Religion {Christian, Muslim, Jewish} Represent as one input taking three different
values, e.g. 0.2, 0.5, 0.8 Represent as three inputs, taking 0/1 values
Determine representation of output (for multiclass problems) Single output unit vs Multiple binary units
64
Classifying with Neural Networks Select
Number of hidden layers Number of hidden units Connectivity Typically: one hidden layer, hidden units is a
small fraction of the input units, full connectivity
Select error function Typically: minimize mean squared error or
maximize log likelihood of the data
65
Classifying with Neural Networks
Select a training method: Typically gradient descent (corresponds to
vanilla Backpropagation) Other optimization methods can be used:
Backpropagation with momentum Trust-Region Methods Line-Search Methods
Congugate Gradient methods Newton and Quasi-Newton Methods
Select stopping criterion
66
Classifying with Neural Networks
Select a training method: May include also searching for optimal
structure May include extensions to avoid getting stuck
in local minima Simulated annealing Random restarts with different weights
67
Classifying with Neural Networks
Split data to: Training set: used to update the weights Validation set: used in the stopping criterion Test set: used in evaluating generalization
error (performance)
68
Other Error Functions in Neural Networks Adding a penalty term for weight magnitude Minimizing cross entropy with respect to
target values network outputs interpretable as probability
estimates
69
Representational Power
Perceptron: Can learn only linearly separable functions
Boolean Functions: learnable by a NN with one hidden layer
Continuous Functions: learnable with a NN with one hidden layer and sigmoid units
Arbitrary Functions: learnable with a NN with two hidden layers and sigmoid units
Number of hidden units in all cases unknown
70
Issues with Neural Networks
No principled method for selecting number of layers and units Tiling: start with a small network and keep
adding units Optimal brain damage: start with a large
network and keep removing weights and units Evolutionary methods: search in the space of
structures for one that generalizes well No principled method for most other design
choices
71
Important but not Covered in This Tutorial Very hard to understand the classification
logic from direct examination of the weights Large recent body of work in extracting
symbolic rules and information from Neural Networks
Recurrent Networks, Associative Networks, Self-Organizing Maps, Committees or Networks, Adaptive Resonance Theory etc.
72
Why the Name Neural Networks?
Initial models that simulate real neurons to use for classification
Efforts to improve and understand classification independent of similarity to biological neural networks
Efforts to simulate and understand biological neural networks to a larger degree
73
Conclusions Can deal with both real and discrete domains Can also perform density or probability estimation Very fast classification time Relatively slow training time (does not scale to
thousands of inputs) One of the most successful classifiers yet Successful design choices still a black art Easy to overfit or underfit if care is not applied