perceptron. inner-product scalar perceptron perceptron learning rule xor problem linear separable...
TRANSCRIPT
Perceptron
Inner-product scalar Perceptron
Perceptron learning rule
XOR problem linear separable patterns
Gradient descent Stochastic Approximation to gradient descent
LMS Adaline
Inner-product
A measure of the projection of one vector onto another
€
net =<r w ,
r x >=||
r w ||⋅ ||
r x ||⋅cos(θ)
€
net = wi ⋅ x i
i=1
n
∑
Example
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Activation function
€
o = f (net) = f ( wi ⋅ x i)i=1
n
∑
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
€
f (x) := sgn(x) =1 if x ≥ 0
−1 if x < 0
⎧ ⎨ ⎩
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
€
f (x) := σ (x) =1
1+ e(−ax )
€
f (x) := ϕ (x) =1 if x ≥ 0
0 if x < 0
⎧ ⎨ ⎩
sigmoid function€
f (x) := ϕ (x) =
1 if x ≥ 0.5
x if 0.5 > x > 0.5
0 if x ≤ −0.5
⎧
⎨ ⎪
⎩ ⎪
Graphical representation
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Similarity to real neurons...
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
1040 Neurons 104-5 connections
per neuronZur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“ benötigt.
Perceptron Linear treshold unit (LTU)
x1
x2
xn
.
..
w1
w2
wn
w0
x0=1
o
€
net = wi ⋅ x i
i= 0
n
∑
€
o = sgn(net) =1 if net ≥ 0
−1 if net < 0
⎧ ⎨ ⎩
McCulloch-Pitts model of a neuron
The goal of a perceptron is to correctly classify the set of pattern D={x1,x2,..xm} into one of the classes C1 and C2
The output for class C1 is o=1 and fo C2 is o=-1
For n=2 Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“ benötigt.
Lineary separable patterns
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
€
o = sgn( wix i
i= 0
n
∑ )
€
wix i
i= 0
n
∑ > 0 for C0
wix i
i= 0
n
∑ ≤ 0 for C1
Perceptron learning rule Consider linearly separable problems How to find appropriate weights Look if the output pattern o belongs to the
desired class, has the desired value d
is called the learning rate 0 < ≤ 1
€
wnew = wold + Δw
€
Δw = η (d − o)
In supervised learning the network has ist output compared with known correct answers Supervised learning Learning with a teacher
(d-o) plays the role of the error signal
Perceptron The algorithm converges to the correct classification
if the training data is linearly separable and is sufficiently small
When assigning a value to we must keep in mind two conflicting requirements Averaging of past inputs to provide stable weights
estimates, which requires small Fast adaptation with respect to real changes in the
underlying distribution of the process responsible for the generation of the input vector x, which requires large
Several nodes
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
€
o1 = sgn( w1ix i
i= 0
n
∑ )
€
o2 = sgn( w2ix i
i= 0
n
∑ )
€
o j = sgn( w jix i
i= 0
n
∑ )
Constructions
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“
benötigt.
Frank Rosenblatt
1928-1969
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Rosenblatt's bitter rival and professional nemesis was Marvin Minsky of Carnegie Mellon University
Minsky despised Rosenblatt, hated the concept of the perceptron, and wrote several polemics against him
For years Minsky crusaded against Rosenblatt on a very nasty and personal level, including contacting every group who funded Rosenblatt's research to denounce him as a charlatan, hoping to ruin Rosenblatt professionally and to cut off all funding for his research in neural nets
XOR problem and Perceptron
By Minsky and Papert in mid 1960
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Gradient Descent To understand, consider simpler linear
unit, where
Let's learn wi that minimize the squared error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)}
• (t for target)€
o = wi ⋅ x i
i= 0
n
∑
Error for different hypothesis, for w0 and w1 (dim 2)
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
We want to move the weigth vector in the direction that decrease E
wi=wi+Δwi w=w+Δw
The gradient
Differentiating E
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Update rule for gradient decent
€
Δwi = η (td
d ∈D
∑ − od )x id
Gradient DescentGradient-Descent(training_examples, )
Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is the vector of input values, and t is the target output value, is the learning rate (e.g. 0.1)
Initialize each wi to some small random value Until the termination condition is met, Do
Initialize each Δwi to zero
For each <(x1,…xn),t> in training_examples Do
• Input the instance (x1,…,xn) to the linear unit and compute the output o
• For each linear unit weight wi
• Do Δwi= Δwi + (t-o) xi
For each linear unit weight wi
Do
• wi=wi+Δwi€
Δwi = η (td
d ∈D
∑ − od )x id
Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“
benötigt.
Stochastic Approximation to gradient descent
The gradient decent training rule updates summing over all the training examples D
Stochastic gradient approximates gradient decent by updating weights incrementally
Calculate error for each example Known as delta-rule or LMS (last mean-square)
weight update Adaline rule, used for adaptive filters Widroff and Hoff
(1960)
€
Δwi = η (t − o)x i
LMS Estimate of the weight vector No steepest decent No well defined trajectory in the weight space Instead a random trajectory (stochastic gradient
descent) Converge only asymptotically toward the
minimum error Can approximate gradient descent arbitrarily
closely if made small enough
Summary Perceptron training rule guaranteed to succeed if
Training examples are linearly separable Sufficiently small learning rate
Linear unit training rule uses gradient descent or LMS guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate Even when training data contains noise Even when training data not separable by H
Inner-product scalar Perceptron
Perceptron learning rule
XOR problem linear separable patterns
Gradient descent Stochastic Approximation to gradient descent
LMS Adaline
XOR?Multi-Layer Networks
input layer
hidden layer
output layer