Supervised Learning
Artificial Neural Networks and Single Layer Perceptron
Supervised Learning• Learning from correct answers
Supervised Learning SystemInputs Outputs
Training info: desired (target outputs)
Supervised Learning Methods• Artificial neural networks• Decision trees• Gaussian process regression• Naive Bayes classifier• Nearest neighbor algorithm• Support vector machines• Random forests• Ensembles of classifiers• …
Applications of Supervised Learning• Handwriting recognition• Object recognition• Optical character
recognition• Spam detection• Pattern recognition• Speech recognition• Prediction models• …
Inspiration: The Human Brain• The brain doesn’t seem to have a
CPU
• Instead, it has got lots of simple parallel, asynchronous units, called neurons
• There are about 1011 (100 billion) neurons of about 20 types
• GPU NVIDIA Tesla K80 has 4992 cores
Neurons• A neuron is a single cell that has a number of relatively short
fibers, called dendrites, and one long fiber, called an axon
Synapses• A synapse is a structure that permits a neuron (or nerve cell)
to pass an electrical signal to another cell
Signal Generation: Action Potential• Neurotransmitters move across the synapse and change the
electrical potential of the cell body
From Real to Artificial Neurons
u2 Σ
u1
un
...
w1
w2
wna=Σi=1
n wi ui
v
inputsweights
activation outputθ
activationfunction
Artificial Neurons• Threshold Logic Unit (TLU) proposed by Warren
McCulloch and Walter Pitts in 1943– Initially with binary inputs and outputs– Heaviside function as threshold
• Perceptron developed by Frank Rosenblatt in 1957– Arbitrary inputs and outputs– Linear transfer function
Activation Functions
a
v
a
v
a
v
a
v
threshold linear
piece-wise linear sigmoid
Threshold Logic Unit (TLU)
u2 Σ
u1
un
...
w1
w2
wna=Σi=1
n wi ui
v
inputsweights
activation output
θ
1 if a ≥ θv=
0 if a < θ{
Classification• Identifying to which of a set of categories a new
observation belongs, on the basis of a set of data containing observations whose category is known (Wikipedia)
-5 0 5
Variable 1
-4
-2
0
2
4
Var
iabl
e 2
Class 1
Class 2
Class 3
? (1, 2 or 3)
Decision Surface of a TLU
u1
u2
Decision linew1 u1 + w2 u2 = θ
1
1 1
0
0
0
0
0
1
> θ
< θ
Scalar Products and Projections
w•u > 0
u
w
w•u = 0
u
w
w•u < 0
u
wϕ ϕ ϕ
w•u = |w||u| cos ϕ|uw| = |u| cos ϕw•u = |w||uw|
u
wϕuw
Geometric Interpretation
u1
u2
w
u
w•u = θ
v=1
v=0
|uw| = θ/|w|uw
Decision linew1 u1 + w2 u2 = θ
w•u = |w||u| cos ϕ|uw| = |u| cos ϕw•u = |w||uw|= θ
Geometric Interpretation (cont.)
u1
u2
w
u
v=1
v=0
uw
w•u = θ
|uw| = θ/|w|
Decision linew1 u1 + w2 u2 = θ
w•u = |w||u| cos ϕ|uw| = |u| cos ϕw•u = |w||uw|= θ
Geometric Interpretation (cont.)
u1
u2
w
u
v=1
v=0
uw
w•u = θ
|uw| = θ/|w|
Decision linew1 u1 + w2 u2 = θ
w•u = |w||u| cos ϕ|uw| = |u| cos ϕw•u = |w||uw|= θ
Decision Surface: Summary• In n dimensions the relation w•u=θ defines a n-1
dimensional hyper-plane, which is perpendicular to the weight vector w
• On one side of the hyper-plane (w•u>θ) all patterns are classified by the TLU as “1”, while those that get classified as “0” lie on the other side of the hyper-plane
Example: Logical AND
u1
u2
10
0 0
u1 u2 a v0 0 ? 00 1 ? 01 0 ? 01 1 ? 1
w1=?w2=?θ=?
Example: Logical AND
u1
u2
10
0 0
u1 u2 a v0 0 0 00 1 1 01 0 1 01 1 2 1
w1=1w2=1θ=?
Example: Logical AND
u1
u2
10
0 0
u1 u2 a v0 0 0 00 1 1 01 0 1 01 1 2 1
w1=1w2=1θ=1.5
Example: Logical OR
u1
u2
11
0 1
u1 u2 a v0 0 ? 00 1 ? 11 0 ? 11 1 ? 1
w1=?w2=?θ=?
Example: Logical OR
u1
u2
11
0 1
u1 u2 a v0 0 0 00 1 1 11 0 1 11 1 2 1
w1=1w2=1θ=0.5
Example: Logical NOT
u11 0
u1 a v0 ? 11 ? 0
w1=?θ=?
Example: Logical NOT
u11 0
u1 a v0 0 11 -1 0
w1=-1θ=-0.5
Threshold as Weight
Σ
u1
u2
un
...
w1
w2
wn
wn+1
un+1=-1
a= Σi=1n+1 wi ui
v
θ=wn+1
θ
1 if a ≥ 0v=
0 if a < 0{
Geometric Interpretation
u1
u2Decision line
w
u
w•u=0v=1
v=0
Training ANNs• Training set S of examples {u,vt}
– u is an input vector and– vt the desired target output– Example: Logical And
S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}
• Iterative process– present a training example u– compute network output v– compare output v with target vt
– adjust weights w and threshold θ
• Learning rule– Specifies how to change the weights w and threshold θ of the
network as a function of the inputs u, output v and target vt.
Presentation of Training Samples• Presenting all training examples once to the ANN
is called an epoch
• Training examples can be presented in– Fixed order (1,2,3…,M) [on/off-line]– Randomly permutated order (5,2,7,…,3) [off-line]– Completely random (4,1,7,1,5,4,……) [off-line]
Perceptron Learning Rule• w`=w + µ (vt-v) u
– w`i = wi + ∆wi = wi + µ (vt-v) ui (i=1..n+1), withwn+1 = θ and un+1=-1
– The parameter µ is called the learning rate. It determines the magnitude of weight updates ∆wi
– If the output is correct (vt=v) the weights are not changed (∆wi =0)
– If the output is incorrect (vt ≠ v) the weights wi are changed such that the output of the TLU for the new weights w`i is closer/further to the input ui
Adjusting the Weight Vector
Target vt =1Output v=0
u
w
ϕ>90
w
uw` = w + µu
µu
Move w in the direction of u
Target vt =0Output v=1
wϕ<90
u
w
u
−µuw` = w - µu
Move w away from the direction of u
Perceptron Learning Rule (cont.)• If we do not include the threshold as an input we
use the following learning rule:
w`=w + µ(vt-v) u
θ`=θ - µ(vt-v)
and
Procedure of Perceptron Training
• While v≠vt for all training vector pairs– For each training vector pair (u, vt)
• calculate the output v: v=Σwu• if v≠vt
update weight vector w: w`=w + µ (vt-v)uupdate threshold θ: θ`=θ – µ(vt-v) (if not included in weights)
elsedo nothing
Demo: Perceptron with TLU
-5 0 5-5
-4
-3
-2
-1
0
1
2
3
4
5
-5 0 5-5
-4
-3
-2
-1
0
1
2
3
4
5
TLU Convergence• The algorithm converges to the correct
classification– if the training data is linearly separable– given sufficiently small learning rate µ
• Solution w is not unique, since if w•u = 0 defines a hyper-plane, so does w` = k w
Multiple TLUs
u1 u2 u3 un
v1 v2 vn. . .
. . .
w’ji = wji + µ (vtj-vj) ui
wji connects ui with vj
• Learing rule:
Example: Character Recognition• 26 classes: A, B, C, …, Z• Target output is a vector
– e.g., vtA= [1 0 0 … 0], vt
B= [0 1 0 … 0], …
u1 u2 u3 un
v1 v2 v26. . .
. . .
A B Z
Activation Functions
a
v
a
v
a
v
a
v
threshold linear
piece-wise linear sigmoid
Perceptron with Linear Function
Σ
u1
u2
un
...
w1
w2
wn
v
v= a = Σi=1n wi vi
inputsweights
activationoutput
Note: We will use notation t instead of vt
Gradient Descent Learning Rule• Consider linear unit without threshold and
continuous output v (not just –1,1/1,0)v = w1 u1 + … + wn un
• Train the wi such that they minimize the squared errorE[w1,…,wn] = 1/2 Σd∈D (td-vd)2
– where D is the set of training examples and– t the target outputs
Gradient Descent Learning Rule (cont.)
Gradient:∇E[w] = [∂E/∂w0,… ,∂E/∂wn]∆w = -µ ∇E[w]Ed[w] = 1/2 (td-vd)2
(w1,w2)
(w1+∆w1,w2 +∆w2)
-1/µ ∆wi = ∂E/∂wi= ∂/∂wi 1/2 (td-vd)2
= ∂/∂wi 1/2 (td-Σi wi ui)2
= Σd(td- vd) (-ui)E
[w1,w
2]w1 w2
We get:∆wi = µ (td- vd) ui Also known as Delta rule
Incremental vs. Batch Gradient Descent
• Incremental (stochastic) mode: w` = w - µ ∇Ed[w] over individual training examples dEd[w] = 1/2 (td-vd)2
• Batch mode:w` = w - µ ∇ED[w] over the entire data DED[w] = 1/2Σd(td-vd)2
Perceptron vs. Gradient Descent Rule• Perceptron rule
w`i = wi + µ (vt-v) ui
– derived from manipulation of decision surface
• Gradient descent rulew`i = wi + µ (td-vd) ui
– derived from minimization of error function E[w1,…,wn] = 1/2 Σd (td-vd)2
by means of gradient descent
Demo: Gradient Descent Learning Rule
-5 0 5-5
-4
-3
-2
-1
0
1
2
3
4
5
-5 0 5-5
-4
-3
-2
-1
0
1
2
3
4
5
Linear Regression• Common statistical method
• Describe relationship between variables
• Predict one variable knowing the others
• We make assumption that there is a linear relationship– between an outcome (dependent variable, response variable) and a
predictor (independent variable, explanatory variable, feature) or
– between one variable and several other variables
Describing Relationships
5 6 7
Sepal length (cm)
2
2.5
3
3.5
4
4.5
Sep
al w
idth
(cm
)
Very weak relationship /very low correlation
Strong relationship /high correlation
cor. coef. = -0.1094
5 6 7
Sepal length (cm)
0
0.5
1
1.5
2
2.5
Pet
al w
idth
(cm
)
cor. coef = 0.8180
Making predictions
0 100 200 300
Days
0
500
1000
1500
Pro
fit (E
uro)
140 160 180 200
Father's height (cm)
150
160
170
180
190
Son
's h
eigh
t (cm
)
Predicting variables Predicting future outcomes
?
?
?
Demo: Linear Regression
0.2 0.4 0.6 0.8 1x
-2
-1.8
-1.6
-1.4
f(x)
0 0.5 1x
-2
-1.8
-1.6
f(x)
Perceptron vs. Gradient Descent• Perceptron (TLU)
– Guaranteed to succeed if• training examples are linearly separable• given sufficiently small learning rate µ
– Robust against outliers
• Gradient descent (linear transfer function)– Guaranteed to converge to hypothesis with minimum
squared error• even when training data not separable by hyperplane• given sufficiently small learning rate µ
– Very sensitive to outliers
Activation Functions
a
v
a
v
a
v
a
v
threshold linear
piece-wise linear sigmoid
Perceptron with Sigmoid Function
Σ
u1
u2
un
...
w1
w2
wna=Σi=1
n wi ui v=σ(a) =1/(1+e-a)
v
inputsweights
activation output
Gradient:∇E[w] = [∂E/∂w0,… ,∂E/∂wn]∆w = -µ ∇E[w]Ed[w] = 1/2 (td-vd)2
Gradient Descent Rule for Sigmoid Activation Function
a
σ
a
σ’
-1/µ ∆wi = ∂E/∂wi
= ∂/∂wi 1/2 (td-vd)2
= ∂/∂wi 1/2 (td - σ(Σi wi ui))2
= (td-vd) σ‘(Σi wi ui) (-ui)
We get:∆wi = µ vd(1-vd)(td-vd) ui
vd = σ(a) = 1/(1+e-a)σ’(a) = e-a/(1+e-a)2
= σ(a) (1-σ(a))
Demo: Gradient Descent with Sigmoid Activation Function
-5 0 5-5
-4
-3
-2
-1
0
1
2
3
4
5
-5 0 5-5
-4
-3
-2
-1
0
1
2
3
4
5
Summary• Threshold Logic Unit
– works only for linearly separable patterns– robust to outliers
• Linear Neuron– converges to minimal squared error even
when patterns are not linearly separable– very sensitive to outliers
• Sigmoid Neuron– combination of TLU and linear neuron– converges to minimal squared error even
when patterns are not linearly separable– robust to outliers