chapter 4 artificial neural networks. questions: what is anns? how to learn an ann? (algorithm) the...
TRANSCRIPT
Chapter 4
Artificial Neural Networks
Questions:
• What is ANNs?
• How to learn an ANN? (algorithm)
• The presentational power of ANNs(advantage and disadvantage)
What is ANNs ------Background
Consider humans
• Neuron switching time 0.001 second
• Number of neurons 1010
• Connections per neuron 104~5
• Scene recognition time 0.1 second
much parallel computation
• Property of neuron: thresholded unit
One motivation for ANN systems is to capture this kind of highly parallel computation based on distributed reprensetation
• classfication
• Voice recognition
• others
What is ANNs? -----Problems related to ANNs
Another example:
Properties of artificial neural nets (ANNs)
• Many neuron like threshold switching units
• Many weighted interconnections among units
• Highly parallel distributed process
• Emphasis on tuning weights automatically
4.1 Perceptrons
> 00 1 2 211 ...( , ..., )1 1 otherwise
n nif x x xo x xn
To simplify notation, set x0 =1
0( ) sgn( )nio x x xi i
�������������������������� ��
0 1( , , ..., )n
0 1( , , ..., )
nx x x x
Learning a perceptron involves choosing values for the weight . Therefore, the space H of candidate hypotheses considered in perceptron learning is the set of all possible real-valued weight vectors.
0,..., n
1{ | }nH
R}
We can view the perceptron as representing a hyperplane decision surface in the n-dimensional space of instances.
Two way to train perceptron:
Perceptron Training Rule and Gradient Descent
0( )
sgn( )
nio x xi i
x
����������������������������
(1). Perceptron Training Rule
i i i ( )i it o x
• is target value
• o is perceptron output
• is small constant called learning rate
( )t x
•Initialize the ωi with random value in the given interval
•Update the value of ωi according to the training example
• A single perceptron can be used to represent many boolean functions, such as AND, OR, NAND, NOR, but fail to represent XOR.
• Eg: g(x1, x2) = AND(x1 ,x2)
o(x1, x2) = sgn(- 0.8 + 0.5x1 + 0.5x2 )
Representation Power of Perceptrons
x1 x1 - 0.8 + 0.5x1 + 0.5x2 O
-1 -1 -1.8 -1
-1 1 -0.8 -1
1 -1 -0.8 -1
1 1 0.2 1
Representation Power of Perceptrons
(a) Can prove it will converge• If training data is linearly separable• and sufficiently small(b)But some functions not representable ,eg: not linearly separa
ble(c) Every boolean function can be represented by some network
of perceptrons only two levels deep
(2). Gradient Descent
Key idea: searching the hypothesis space to find the weights that best fit the training examples.
Best fit: minimize the squared error
Where D is set of training examples
21( ) ( )
2 d dd D
E t o
0 1
( ) ( , , , )n
E E EE
Gradient:
Training rule:
( )E
i i i
ii
E
or
Gradient Descent
21( )
2
( ) ( )
( )( )
d dd Di i
d d d dd D i
d d idd D
Et o
t o t x
t o x
( )i d d idd D
t o x
i i i
ii
E
Gradient Descent
Gradient Descent Algorithm
Initialize each ωi to some small random value
• Until the termination condition is met , Do
– Initialize each Δωi to zero.
– For each <x, t> in training examples Do
• Input the instance x to the unit and compute the output o
• For each linear unit weight ωi Do
– For each linear unit weight ωi ,Do
i i ( ) it o x
i i i
When to use gradient descent
• Continuously parameterized hypothesis
• The error can be differentiable
Advantage vs Disadvantage
Advantage
• Guaranteed to converge to hypothesis with
local minimum error , Given sufficiently small learning rate η;
• Even when training data contains noise;
• Training data not linear separable ;
• Converge to the single global minimum.
Disadvantage
• Converging sometimes can be very slow;
• No guarantee Converging to global minimum in cases where there are multiple local minima
Incremental (Stochastic) Gradient Descent
standard Gradient Descent
Do until satisfied
• Compute the gradient
•
Stochastic Gradient Descent
For each training example d in D
• Compute the gradient
•
( )DE
( )DE
( )dE
( )dE
21( ) ( )
2d d dE t o 21
( ) ( )2D d d
d D
E t o
Vs.
Standard Gradient Descent vs. Stochastic Gradient Descent
• Stochastic Gradient Descent can approximate Standard Gradient Descent arbitrarily closely if η made small enough;
• Stochastic mode can converge faster;
• Stochastic Gradient descent can sometimes avoid falling into local minima.
(3).Perceptron training rule Vs. gradient descent
Perceptron training rule• Thresholded perceptron output: • Provided examples are linearly separable• Converge to a hypothesis that perfectly classfies the trainin
g data
gradient descent• Unthresholded linear output:• Regardless of whether the training data are linearly separa
ble • Converge asymptotically toward the minimum error hypot
hesis
( ) sgn( )o x x�������������������������� ��
( )o x x�������������������������� ��
4.2 Multilayer Networks
Perceptron: Network:
Perceptrons can only express liner decision,we need to express a rich variety of nonlinear decision
Sigmoid unit – a differentiable threshold unit
1( ) ( 1)
1 kxx here k
e
( )( )(1 ( ))
d xx x
dx
1( ) ( ( ) )
1 neto net net x
e
Sigmoid function:
Property:
Output:
Why do we use sigmoid instead of linear and
sgn(x)?
• computing the input and output of each unit foreword;
• modifying the weights of units pairs backward with respect to errors
The main idea of backpropagation algorithm
The Backpropagation Algorithm
21( ) ( )
2D kd kdd D k outputs
E t o
21( ) ( )
2d kd kdk outputs
E t o
Error definition :
Batch mode:
Individual mode:
ji
ji
j
j
j
x =the ith input to unit j
= the weight associated with the ith input to unit j
net (the weighted sum of inputs for unit j)
o = the output computed by unit j
t = the target output
ji jiix
for unit j
outputs =the set of units in the final layer
Ds(j) = the set ot units whose immediate
inputs include the output of j
oj
ω ij
oi = xji
… …
j net ji jiix
Training rule for Output Unit weights
jd d
j j j
oE E
net o net
21( ) ( )
2d
j j j jj j
Et o t o
o o
( )(1 )j j
j jj j
o neto o
net net
( ) (1 )dj j j j
j
Et o o o
net
( ) (1 )dji j j j j
ji
Et o o o
Training Rule for Hidden Unit Weights
( )
( )
( )
(
(
)
)
(1 )
d d k
k Ds jj k j
kk
k Ds j j
jk
k kj
kk Ds j j
j jk Ds j
j
jk kj
k Ds j j
E E net
net net net
net
net
o
onet
o net
o
ne
o
t
j k( )
(1 )
we have
j j kjk Ds j
and
o o
j d
j
Edenote
net
j jiji x
Error term
ok
Backpropagation Algorithm
• Initialize all weights to small random numbers
• Until termination condition is met Do
For each training example Do
//Propagate the input forward
1. Input the training example to the network and compute the network outputs
//Propagate the errors backward
2. For each output unit k
3. For each hidden unit h
4.Update each network weight
where
( ) (1 )k k k k kt o o o
h k(1 )h h khk outputs
o o
ji ji ji
ji j jix
Hidden layer Representations
Hidden layer Representations
Hidden layer Representations
Hidden layer Representations
Hidden layer Representations
Convergence and local minima
• Converge to some local minimum and not necessarily to the global minimum error
• Use stochastic gradient descent rather than the standard gradient descent
• Initialization will influence the convergence. Training multiple networks network with different initializing random weights,over the same data, then select the best one
• Training can take thousands of iterations -->slow
• Initialize weights near zero, Therefore initial networks near linear. Increasingly nonlinear functions is possible as training progresses
• Add a momentum term to speed convergence
j ji( ) ( 1) (0 1)ji jin x n
Expressive Capabilities of ANNs
• Every boolean function can be represented by network with single hidden layer
• Every bounded continuous function can be approximated with arbitrarily small error by network with one hidden layer
• Any function can be approximated to arbitrary accuracy by a network with two hidden layers
• The network with more hidden layers possibly results in the rise of precision , the possibility of converging to a local minima ,however, will increase as well.
When to Consider Neural Networks
• Input is high dimensional discrete or real valued
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
Overfitting in ANNs
Strategy applied to avoid overfitting
• Poor strategy: continue training until the error falls below some threshold
• A good indicator : the number of iterations that produces the lowest error over the validation set
• Once the trained weights reach a significantly higher error over the validation set than the stored weights, terminate!
Alternative Error Functions
Recurrent Networks
Recurrent Networks
Thank you !