cs 5751 machine learning chapter 4 artificial neural networks1 artificial neural networks threshold...
TRANSCRIPT
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 1
Artificial Neural Networks • Threshold units
• Gradient descent
• Multilayer networks
• Backpropagation
• Hidden layer representations
• Example: Face recognition
• Advanced topics
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 2
Connectionist ModelsConsider humans• Neuron switching time ~.001 second• Number of neurons ~1010
• Connections per neuron ~104-5
• Scene recognition time ~.1 second• 100 inference step does not seem like enough
must use lots of parallel computation!Properties of artificial neural nets (ANNs):• Many neuron-like threshold switching units• Many weighted interconnections among units• Highly parallel, distributed process• Emphasis on tuning weights automatically
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 3
When to Consider Neural Networks• Input is high-dimensional discrete or real-valued (e.g., raw
sensor input)
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
Examples:• Speech phoneme recognition [Waibel]
• Image classification [Kanade, Baluja, Rowley]
• Financial prediction
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 4
ALVINN drives 70 mph on highways
4 HiddenUnits
30x32 SensorInput Retina
SharpLeft
SharpRight
StraightAhead
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 5
Perceptron
n
W2
0
1
X 1
X 2
X n
X 0 =1
i
n
ii xw
0
otherwise 1-
if 1n
1iii xw
otherwise 1-
0 if 1
:notationctor simpler ve use will weSometimes
otherwise 1-
0... if 1),...,( 110
1
xw ) xo(
xwxwwxxo nn
n
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 6
Decision Surface of Perceptron
X 2
X1
X 2
X1
Represents some useful functions
• What weights represent g(x1,x2) = AND(x1,x2)?
But some functions not representable
• e.g., not linearly separable
• therefore, we will want networks of these ...
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 7
Perceptron Training Rule•
smallly sufficient is and
separablelinearly is data trainingIf
converge it will proveCan
rate learning called .1) (e.g.,constant small is
output perceptron is
lue target vais )(
)(
where
o
xct
xotw
www
ii
iii
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 8
Gradient Descent
examples trainingofset theis Where
)(
error squared theminimize that s'learn :Idea
where, simpleconsider ,understand To
221
110
D
otwE
w
xw...xwwo
tlinear uni
Dddd
i
nn
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 9
Gradient Descent
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 10
Gradient Descent
ii
i
n
w
Ew
wEw
w
E
w
E
w
EwE
i.e.,
][ :rule Training
,...,,][ Gradient 10
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 11
Gradient Descent
))((
)()(
)()(22
1
)(2
1
)(2
1
,
2
2
ddidd
i
ddd i
dd
ddd i
dd
ddd i
ddd
ii
xotw
E
xwtw
ot
otw
ot
otw
otww
E
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 12
Gradient Descent
iii
iii
i
i
i
www
xotww
w
ox
examplestraining,tx
w
w
). (e.g., .rning rateis the leavalue. et output s the tes and t iinput valu
ctor of is the vexwhere ,,txe form pair of thples is a ining exam Each tra
examplestraining
do t w,unit weighlinear each For -
)(
do ,t unit weighlinear each For *
output compute and instance Input the
do ,_in each For -
zero. toeach Initialize -
do met, iscondition on terminati the Until
valuerandom small some toeach Initialize
05 arg
) ,_(DESCENTGRADIENT
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 13
SummaryPerceptron training rule guaranteed to succeed if• Training examples are linearly separable• Sufficiently small learning rate
Linear unit training rule uses gradient descent• Guaranteed to converge to hypothesis with
minimum squared error• Given sufficiently small learning rate • Even when training data contains noise• Even when training data not separable by H
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 14
Incremental (Stochastic) Gradient DescentBatch mode Gradient Descent:Do until satisfied:
][ 2.
gradient theCompute .1
wEww
]w[E
D
D
Incremental mode Gradient Descent:Do until satisfied: - For each training example d in D
][ 2.
gradient theCompute .1
wEww
]w[E
d
d
221 )(][ d
DddD otwE
221 )(][ ddd otwE
Incremental Gradient Descent can approximate Batch GradientDescent arbitrarily closely if made small enough
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 15
Multilayer Networks of Sigmoid Units
d 1
d 0 d 2
d 3
d 1
d 0
d 2
d 3
h 1h 2
x 1 x 2
h 1 h 2
o
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 16
Multilayer Decision Space
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 17
Sigmoid Unit
W n
W2
W0W
1
X 1
X 2
X n
X 0 =1
i
n
ii xwnet
0
neteneto
1
1)(σ
ationBackpropag networksMultilayer
xxdx
xde
x
-x
units sigmoid of
unit sigmoid One
train torulesdescent gradient derivecan We
))( 1)(( )(
:property Nice
1
1
function sigmoid theis )(σ
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 18
The Sigmoid Function
00.10.20.30.40.50.60.70.80.9
1
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
net input
ou
tpu
t
-xex
1
1)(σ
Sort of a rounded step functionUnlike step function, can take derivative (makes learning possible)
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 19
Error Gradient for a Sigmoid Unit
i
d
d
d
ddd
d i
ddd
dd
di
dd
dd
di
dDd
dii
w
net
net
oot
w
oot
otw
ot
otw
otww
E
)(-
)(
)()(22
1
)(2
1
)(2
1
2
2
didddDd
di
dii
d
i
d
ddd
d
d
d
xoootw
E
xw
xw
w
net
oonet
net
net
o
,
,
)1()(
:So
)(
)1()(
:know But we
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 20
Backpropagation Algorithm
jijji
jijiji
ji
koutputsk
khhhk
kkkkk
xw
www
w
woo
h
otoo
k
,,
,,,
,
,
ere wh
ight network weeach Update4.
)1(
unit hidden each For 3.
))(1(
unit output each For 2.
outputs thecompute and example trainingInput the 1.
do example, ingeach trainFor
do satisfied, Untilnumbers. random small to weightsall Initialize
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 21
More on Backpropagation• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
– In practice, often works well (can run multiple times)
• Often include weight momentum
• Minimizes error over training examples
• Will it generalize well to subsequent examples?
• Training can take thousands of iterations -- slow!
– Using network after training is fast
)1( )( ,,, nwxnw jijijji
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 22
Learning Hidden Layer Representations
00000001 00000001
00000010 00000010
00000100 00000100
00001000 00001000
00010000 00010000
00100000 00100000
01000000 01000000
10000000 10000000
Output Input
Inputs
Outputs
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 23
Learning Hidden Layer Representations
Inputs
Outputs
00000001 .01 .94 .60 00000001
00000010 .98 .01 .80 00000010
00000100 .99 .99 .22 00000100
00001000 .02 .05 .03 00001000
00010000 .71 .97 .99 00010000
00100000 .27 .97 .01 00100000
01000000 .88 .11 .01 01000000
10000000 .08 .04 .89 10000000
Output Input
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 24
Output Unit Error during Training
Sum of squared errors for each output unit
00.10.20.30.4
0.50.60.70.80.9
0 500 1000 1500 2000 2500
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 25
Hidden Unit Encoding
Hidden unit encoding for one input
0.10.20.30.40.5
0.60.70.80.9
1
0 500 1000 1500 2000 2500
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 26
Input to Hidden Weights
Weights from inputs to one hidden unit
-5-4-3-2-1
01234
0 500 1000 1500 2000 2500
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 27
Convergence of BackpropagationGradient descent to some local minimum• Perhaps not global minimum
• Momentum can cause quicker convergence
• Stochastic gradient descent also results in faster convergence
• Can train multiple networks and get different results (using different initial weights)
Nature of convergence• Initialize weights near zero
• Therefore, initial networks near-linear
• Increasingly non-linear functions as training progresses
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 28
Expressive Capabilities of ANNsBoolean functions:• Every Boolean function can be represented by network
with a single hidden layer
• But that might require an exponential (in the number of inputs) hidden units
Continuous functions:• Every bounded continuous function can be approximated
with arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989]
• Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 29
Overfitting in ANNs
Error versus weight updates (example 1)
0.0020.0030.0040.0050.0060.0070.0080.009
0.01
0 5000 10000 15000 20000
Number of weight updates
Err
or
Training set
Validation set
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 30
Overfitting in ANNs
Error versus weight updates (Example 2)
00.010.020.030.040.050.060.070.08
0 1000 2000 3000 4000 5000 6000
Number of weight updates
Err
or
Training set
Validation set
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 31
Neural Nets for Face Recognition
30x32inputs
left strt rgt up
Typical Input Images
90% accurate learninghead pose, and recognizing1-of-20 faces
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 32
Learned Network Weights
Typical Input Images
30x32inputs
left strt rgt upLearned Weights
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 33
Alternative Error Functions
nrecognitio phonemein e.g.,
:weights together Tie
)()(
: valuesas wellas slopeson target Train
)()(
: weightslarge Penalize
22
21
ji,
2221
Dd outputsk djkd
dj
kdkdkd
jikdDd outputsk
kd
x
o
x
totwE
wotwE
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 34
Recurrent Networks
FeedforwardNetwork
y(t+1)
x(t)
RecurrentNetwork
y(t+1)
x(t)
y(t+1)
x(t)
y(t)
x(t-1)
y(t-1)
x(t-2)
Recurrent Networkunfolded in time
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 35
Neural Network Summary• physiologically (neurons) inspired model• powerful (accurate), slow, opaque (hard to
understand resulting model)• bias: preferential
– based on gradient descent
– finds local minimum
– effect by initial conditions, parameters
• neural units– linear
– linear threshold
– sigmoid
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 36
Neural Network Summary (cont)• gradient descent
– convergence
• linear units– limitation: hyperplane decision surface
– learning rule
• multilayer network– advantage: can have non-linear decision surface
– backpropagation to learn• backprop learning rule
• learning issues– units used
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 37
Neural Network Summary (cont)• learning issues (cont)
– batch versus incremental (stochastic)
– parameters• initial weights
• learning rate
• momentum
– cost (error) function• sum of squared errors
• can include penalty terms
• recurrent networks– simple
– backpropagation through time