computational intelligence (introduction to machine learning) ss14 · radial basis functions...
TRANSCRIPT
COMPUTATIONAL
INTELLIGENCE(INTRODUCTION TO MACHINE LEARNING) SS18
Lecture 2:
• Linear Regression
• Gradient Descent
• Non-linear basis functions
LINEAR REGRESSION
MOTIVATION
Why Linear Regression?
• Simplest machine learning algorithm for regression• Widely used in biological, behavioural and social sciences to describe
and to extract relationships between variables from data
• Prediction of real-valued outputs
• Easy to implement, fast to execute
• Benchmark algorithm for comparison with more complex algorithms
• Introduction to notation and concepts that we will need again later in
the course• Data format, vector & matrix notation
• Learning from data by minimizing a cost function
• Gradient descent
• Non-linear features and basis functions• Preparation for neural networks
Applications of (linear) regression
• Brain computer interfaces
• https://www.youtube.com/watch?v=Ae6En8-eaww
• Neuroprosthetic control
• https://www.youtube.com/watch?v=X_AI4MiY6L4
LINEAR REGRESSION
WITH ONE INPUT
A regression problem• We want to learn to predict a person’s height based on his/her
knee height and/or arm span
• This is useful for patients who are bed bound or in a wheelchair
and cannot stand to take an accurate measurement of their height
Knee
Height
[cm]
Arm
span
[cm]
Height
[cm]
50 166 171
56 172 175
52 174 168
… … …
Linear regression with one input
…
Learning algorithm
„Hypothesis“
hx
Training set
Hypothesis
Parameters
Test input
Prediction
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
?
?
Example Data
Knee
height
[cm]
Arm
span
[cm]
Height
[cm]
50 166 171
56 172 175
52 174 168
… … …
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
160 165 170 175 180 185 190170
175
180
185
190
armspan
body h
eig
ht
m=30 data points
Example Data
4550
5560
160
180
200170
175
180
185
190
knee heightarmspan
body h
eig
ht
Knee
Height
[cm]
Arm
span
[cm]
Height
[cm]
50 166 171
56 172 175
52 174 168
… … …
Linear regression with one input
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
Knee
Height
[cm]
Height
[cm]
50 171
56 175
52 168
… …
HypothesisParameters ?
Which hypothesis is better?
In what sense is it better?
Formalization of problem
• Given m training examples
• Goal: learn parameters
such that
for all training examples i=1…30.
…
Knee
Height
[cm]
Height
[cm]
50 171
56 175
52 168
… …
m=30 data points
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
Least Squares Objective
• Minimize Error
0.6
150
Least Squares Objective
• Minimize Error
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
10.77
0.6
150
cost function mean squared error
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
Least Squares Objective
• Minimize Error
5.94
0.75
140
cost function mean squared error
Cost function illustrated
Properties of cost function:
• Quadratic function
• Convex function
Unique local and global
minimum (under
„regular“ conditions)
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
10.77
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
5.94
Minimizing the cost
• Two ways to find the parameters
minimizing
• Gradient descent
• Direct analytical solution
(setting derivatives = 0)
Recall: Functions of multiple variables
• Example:
• Partial derivatives
• Gradient vector is formed with the partial derivatives (fundamental in lecture 2)
• Chain rule (fundamental for neural networks in lecture 4)
• Function of multiple variable with high dimensional values
• Jacobian matrix is formed with the partial derivatives
GRADIENT DESCENT
Descending in the steepest directionGradient descent on some arbitrary cost function …
learning rate („eta“)
Gradient descent algorithm
• Repeat until convergence
(simultaneously updating
and )
partial derivative of
with respect to
negative gradient =
descent
Gradient is orthogonal to contour
lines
-2-1
01
2 -2
-1
0
1
20
0.5
1
1.5
2
2.5
3
3.5
4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
A contour line
is a line along which
= const
Potential issues with gradient descent
• May get stuck in local minima
• Learning rate too small: slow
convergence
• Learning rate too large: oscillations,
divergence
too small too large
LINEAR REGRESSION
WITH GRADIENT
DESCENT(ONE INPUT)
Application of gradient descent
• Linear regression cost • Gradient descent
(simultaneous update)
(simultaneous
update)
“error” “input”
”learning rate”
Predicting height from knee height
• Optimal fit to training data
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
0.8
137.4
LINEAR REGRESSIONMORE GENERAL FORMULATION: MULTIPLE FEATURES
Multiple inputs (features)
• Notation:
… number of training examples
… number of features
… input features of i‘th training example (vector-valued)
…. value of feature j in i‘th training example
Knee
Height
x1
Arm
span
x2
Age
x3
Height
y
50 166 32 171
56 172 17 175
52 174 62 168
… … … …
= 3
=
56
172
17
= 17
Linear hypothesis
• Hypothesis (one input):
• Hypothesis (multiple input features):
• More compact notation:
Example: h(x) = 50 + 0.5*kneeheight + 0.3*armspan + 0.1*age
Introduce
Why? Notation convenience!
Multiple inputs (features) revisited
• Notation:
… number of training examples
… number of features
… input features of i‘th training example (vector-valued)
…. value of feature j in i‘th training example
x0
Knee
Height
x1
Arm
span
x2
Age
x3
Height
y
1 50 166 32 171
1 56 172 17 175
1 52 174 62 168
1 … … … …
= 3
=
1
56
172
17
= 17
= 1
Matrix and vector notation
x0
Knee
Height
x1
Arm
span
x2
Age
x3
Height
y
1 50 166 32 171
1 56 172 17 175
1 52 174 62 168
(n+1) ˟ 1 m ˟ (n+1) m ˟ 1
design matrixfeatures of i‘th training example output/target vector
Matrix and vector notation
x0
Knee
Height
x1
Arm
span
x2
Age
x3
Height
y
1 50 166 32 171
1 56 172 17 175
1 52 174 62 168
𝐻 𝜽 = 𝑋𝜽
LINEAR REGRESSION
WITH GRADIENT
DESCENT(GENERAL FORMULATION)
Linear regression problem statement
• Hypothesis:
• Cost function:
Goal is to find parameters which minimize the cost
high-dimensional quadratic
(„bowl“-shaped) function
Gradient descent (multiple features)
(simultaneous
update for
j=0…n)
For j = 0: define for convenience
with one input feature:
with n input features:
(simultaneous
update)
“error”
“error”
“input”
“input””learning rate”
”learning rate”
LINEAR REGRESSION
ANALYTICAL SOLUTION
Analytical solution
… design matrix
… output/target vector
• Set all partial derivatives of cost
function = 0
• Solving system of linear
equations yields:
Moore-Penrose Pseudoinverse of
• Note: This analytical solution requires that columns of are linearly
independent („regular“ conditions)
Example: analytical solution applied
to problem with one input
Knee
Height
[cm]
Height
[cm]
50 171
56 175
52 168
… …
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
Example: analytical solution applied
to problem with one input
Knee
Height
[cm]
Height
[cm]
50 171
56 175
52 168
… … 30 ˟ 2 30 ˟ 1
2 ˟ 2
2 ˟ 2
2 ˟ 1
Predicting height from knee height
45 50 55 60170
175
180
185
190
knee height
body h
eig
ht
0.8
137.4
Gradient descent Analytical solution
• Need to choose learning
rate
• Iterative algorithm (needs
many iterations to
converge)
• Works well even when
number of input features
is large
• No need to choose
• Direct solution (no
iteration)
• Slow if is too large
(inverting n x n matrix)
NON-LINEAR FEATURES(NON-LINEAR BASIS FUNCTIONS)
Non-linear trends in data
x y
0.01 -0.27
-1.22 2.63
0.17 -0.13
… …
-4 -3 -2 -1 0 1 2 3-2
0
2
4
6
8
10
12
14
16
-4 -3 -2 -1 0 1 2 3-2
0
2
4
6
8
10
12
14
16
• How can we learn non-linear hypotheses?
?
? ? ?
Linear fit to this “non-linear” data
x y
0.01 -0.27
-1.22 2.63
0.17 -0.13
… …
standard design matrix
Hypothesis:
Optimal parameters:
Linear fit to this “non-linear” data
-4 -3 -2 -1 0 1 2 3-2
0
2
4
6
8
10
12
14
16
Non-linear (quadratic) fit
x y
0.01 -0.27
-1.22 2.63
0.17 -0.13
… …
design matrix with
non-linear features
Hypothesis:
Optimal parameters:
Non-linear (quadratic) fit
-4 -3 -2 -1 0 1 2 3-2
0
2
4
6
8
10
12
14
16
Non-linear (sinusoid) fit
x y
0.01 -0.27
-1.22 2.63
0.17 -0.13
… …
design matrix with
non-linear features
Hypothesis:
Optimal parameters:
Non-linear (sinusoidal) fit
-4 -3 -2 -1 0 1 2 3-2
0
2
4
6
8
10
12
14
16
Non-linear input features (in general)
• Feature 2 for each training example i is computed by applying a
non-linear basis function:
• Allows to learn a variety of non-linear functions with the same technique(s):• Analytical or gradient descent
all features of
1st training example
feature 2 of all training examples
Polynomial regression• Features are powers of x
n = degree of polynome
to be learned
n=0 n=1
n=3 n=9
What happened here?
Next lecture…
Radial basis functions
• „Gaussian“-shaped RBFs (localized representation):• Each basis function j has a center in the input space
• The width of the basis functions is determined by
-6 -4 -2 0 2 4 6 80
0.2
0.4
0.6
0.8
1
x
-6 -4 -2 0 2 4 6 80
0.2
0.4
0.6
0.8
1
x
Radial basis functions
• „Gaussian“-shaped RBFs:• Each basis function j has a center in the input space
• The width of the basis functions is determined by
-6 -4 -2 0 2 4 6 80
0.2
0.4
0.6
0.8
1
x
Radial basis functions
• „Gaussian“-shaped RBFs:• Each basis function j has a center in the input space
• The width of the basis functions is determined by
Fitting a single RBF to data
-4 -2 0 2 4 6-2
0
2
4
6
8
10
12
14
16
RBF with
-4 -2 0 2 4 6-2
0
2
4
6
8
10
12
14
16
-4 -2 0 2 4 60
0.2
0.4
0.6
0.8
1
-4 -2 0 2 4 6-15
-10
-5
0
Fitting RBFs to data
-4 -2 0 2 4 6-2
0
2
4
6
8
10
12
14
16
-4 -2 0 2 4 6-2
0
2
4
6
8
10
12
14
16
RBFs with
Image: JPEG = cosine-basis
Each block of 8x8 pixels is represented in a
Fourier basis of cosine filters
Better representation of edges and
corners and compresses the data
SUMMARY (QUESTIONS)
Some questions…
• Hypothesis for linear regression = ?
• Cost function for linear regression = ?
• How many local minima may the cost function for lin. reg. have (under
regular conditions)?
• Name two ways to minimize the cost function?
• General gradient descent formula?
• How is Linear regression with gradient descent solved?
• What issues can arise during gradient descent?
• What is the design matrix? What are its dimensions?
• Analytical solution for linear regression = ?• What are the components of the solution?
• Pros and Cons of gradient descent vs. analytical solution?
• How can one learn non-linear hypotheses with linear regression?
• What is polynomial regression?
• What are radial basis functions?
What is next?
• Classification with Logistic Regression
• Gradient descent tricks & more advanced optimization techniques
• Underfitting & Overfitting
• Model selection (Training, Validation and test set)