pattern recognition and machine learning

61
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION

Upload: aron

Post on 24-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Pattern Recognition and Machine Learning. Chapter 3: Linear models for regression. Linear Basis Function Models (1). Example: Polynomial Curve Fitting. Sum-of-Squares Error Function. 0 th Order Polynomial. 1 st Order Polynomial. 3 rd Order Polynomial. 9 th Order Polynomial. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pattern Recognition  and  Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 3: LINEAR MODELS FOR REGRESSION

Page 2: Pattern Recognition  and  Machine Learning

Linear Basis Function Models (1)

Example: Polynomial Curve Fitting

Page 3: Pattern Recognition  and  Machine Learning

Sum-of-Squares Error Function

Page 4: Pattern Recognition  and  Machine Learning

0th Order Polynomial

Page 5: Pattern Recognition  and  Machine Learning

1st Order Polynomial

Page 6: Pattern Recognition  and  Machine Learning

3rd Order Polynomial

Page 7: Pattern Recognition  and  Machine Learning

9th Order Polynomial

Page 8: Pattern Recognition  and  Machine Learning

Linear Basis Function Models (2)

Generally

where Áj(x) are known as basis functions.Typically, Á0(x) = 1, so that w0 acts as a bias.In the simplest case, we use linear basis

functions : Ád(x) = xd.

Page 9: Pattern Recognition  and  Machine Learning

Linear Basis Function Models (3)

Polynomial basis functions:

These are global; a small change in x affect all basis functions.

Page 10: Pattern Recognition  and  Machine Learning

Linear Basis Function Models (4)

Gaussian basis functions:

These are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (width).

Page 11: Pattern Recognition  and  Machine Learning

Linear Basis Function Models (5)

Sigmoidal basis functions:

where

Also these are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (slope).

Page 12: Pattern Recognition  and  Machine Learning

Maximum Likelihood and Least Squares (1)

Assume observations from a deterministic function with added Gaussian noise:

which is the same as saying,

Given observed inputs, , and targets, , we obtain the likelihood function

where

Page 13: Pattern Recognition  and  Machine Learning

Maximum Likelihood and Least Squares (2)

Taking the logarithm, we get

where

is the sum-of-squares error.

Page 14: Pattern Recognition  and  Machine Learning

Computing the gradient and setting it to zero yields

Solving for w, we get

where

Maximum Likelihood and Least Squares (3)

The Moore-Penrose pseudo-inverse, .

Page 15: Pattern Recognition  and  Machine Learning

Geometry of Least Squares

Consider

S is spanned by .wML minimizes the distance between t and its orthogonal projection on S, i.e. y.

N-dimensionalM-dimensional

Page 16: Pattern Recognition  and  Machine Learning

Sequential Learning

Data items considered one at a time (a.k.a. online learning); use stochastic (sequential) gradient descent:

This is known as the least-mean-squares (LMS) algorithm. Issue: how to choose ´?

Page 17: Pattern Recognition  and  Machine Learning

1st Order Polynomial

Page 18: Pattern Recognition  and  Machine Learning

3rd Order Polynomial

Page 19: Pattern Recognition  and  Machine Learning

9th Order Polynomial

Page 20: Pattern Recognition  and  Machine Learning

Over-fitting

Root-Mean-Square (RMS) Error:

Page 21: Pattern Recognition  and  Machine Learning

Polynomial Coefficients

Page 22: Pattern Recognition  and  Machine Learning

Regularized Least Squares (1)

Consider the error function:

With the sum-of-squares error function and a quadratic regularizer, we get

which is minimized by

Data term + Regularization term

¸ is called the regularization coefficient.

Page 23: Pattern Recognition  and  Machine Learning

Regularized Least Squares (1)

Is it true that or

Page 24: Pattern Recognition  and  Machine Learning

Regularized Least Squares (2)

With a more general regularizer, we have

Lasso Quadratic

Page 25: Pattern Recognition  and  Machine Learning

Regularized Least Squares (3)

Lasso tends to generate sparser solutions than a quadratic regularizer.

Page 26: Pattern Recognition  and  Machine Learning

Understanding Lasso Regularizer

• Lasso regularizer plays the role of thresholding

Page 27: Pattern Recognition  and  Machine Learning

Multiple Outputs (1)

Analogously to the single output case we have:

Given observed inputs, , and targets, , we obtain the log likelihood function

Page 28: Pattern Recognition  and  Machine Learning

Multiple Outputs (2)

Maximizing with respect to W, we obtain

If we consider a single target variable, tk, we see that

where , which is identical with the single output case.

Page 29: Pattern Recognition  and  Machine Learning

The Bias-Variance Decomposition (1)

Recall the expected squared loss,

where

The second term of E[L] corresponds to the noise inherent in the random variable t.

What about the first term?

Page 30: Pattern Recognition  and  Machine Learning

The Bias-Variance Decomposition (2)

Suppose we were given multiple data sets, each of size N. Any particular data set, D, will give a particular function y(x;D). We then have

Page 31: Pattern Recognition  and  Machine Learning

The Bias-Variance Decomposition (3)

Taking the expectation over D yields

Page 32: Pattern Recognition  and  Machine Learning

The Bias-Variance Decomposition (4)

Thus we can write

where

Page 33: Pattern Recognition  and  Machine Learning

Bias-Variance Tradeoff

Page 34: Pattern Recognition  and  Machine Learning

The Bias-Variance Decomposition (5)

Example: 25 data sets from the sinusoidal, varying the degree of regularization, ¸.

Page 35: Pattern Recognition  and  Machine Learning

The Bias-Variance Decomposition (6)

Example: 25 data sets from the sinusoidal, varying the degree of regularization, ¸.

Page 36: Pattern Recognition  and  Machine Learning

The Bias-Variance Decomposition (7)

Example: 25 data sets from the sinusoidal, varying the degree of regularization, ¸.

Page 37: Pattern Recognition  and  Machine Learning

The Bias-Variance Trade-off

From these plots, we note that an over-regularized model (large ¸) will have a high bias, while an under-regularized model (small ¸) will have a high variance.

Page 38: Pattern Recognition  and  Machine Learning

Bayesian Linear Regression (1)

Define a conjugate prior over w

Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions, gives the posterior

where

Page 39: Pattern Recognition  and  Machine Learning

Bayesian Linear Regression (2)

A common choice for the prior is

for which

Next we consider an example …

Page 40: Pattern Recognition  and  Machine Learning

Bayesian Linear Regression (3)

0 data points observed

Prior Data Space

Page 41: Pattern Recognition  and  Machine Learning

Bayesian Linear Regression (4)

1 data point observed

Likelihood Posterior Data Space

Page 42: Pattern Recognition  and  Machine Learning

Bayesian Linear Regression (5)

2 data points observed

Likelihood Posterior Data Space

Page 43: Pattern Recognition  and  Machine Learning

Bayesian Linear Regression (6)

20 data points observed

Likelihood Posterior Data Space

Page 44: Pattern Recognition  and  Machine Learning

Predictive Distribution (1)

Predict t for new values of x by integrating over w:

where

Page 45: Pattern Recognition  and  Machine Learning

Predictive Distribution (1)How to compute the predictive distribution ?

Page 46: Pattern Recognition  and  Machine Learning

Predictive Distribution (2)

Example: Sinusoidal data, 9 Gaussian basis functions, 1 data point

Page 47: Pattern Recognition  and  Machine Learning

Predictive Distribution (3)

Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points

Page 48: Pattern Recognition  and  Machine Learning

Predictive Distribution (4)

Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points

Page 49: Pattern Recognition  and  Machine Learning

Predictive Distribution (5)

Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points

Page 50: Pattern Recognition  and  Machine Learning

Salesman’s Problem

Given • a consumer a described by a vector x• a product b to sell with base cost c• estimated price distribution of b in the mind

of a is Pr(t|x, w) = N(xTw, 2)

What is price should we offer to a ?

Page 51: Pattern Recognition  and  Machine Learning

Salesman’s Problem

Page 52: Pattern Recognition  and  Machine Learning

Bayesian Model Comparison (1)

How do we choose the ‘right’ model?

Page 53: Pattern Recognition  and  Machine Learning

Bayesian Model Comparison (1)

How do we choose the ‘right’ model?

Assume we want to compare models Mi, i=1, …,L, using data D; this requires computing

Posterior Prior Model evidence or marginal likelihood

Page 54: Pattern Recognition  and  Machine Learning

Bayesian Model Comparison (1)

How do we choose the ‘right’ model?

Posterior Prior Model evidence or marginal likelihood

Page 55: Pattern Recognition  and  Machine Learning

Bayesian Model Comparison (3)

For a model with parameters w, we get the model evidence by marginalizing over w

Page 56: Pattern Recognition  and  Machine Learning

Bayesian Model Comparison (4)

For a given model with a single parameter, w, con-sider the approximation

where the posterior is assumed to be sharply peaked.

Page 57: Pattern Recognition  and  Machine Learning

Bayesian Model Comparison (4)

For a given model with a single parameter, w, con-sider the approximation

where the posterior is assumed to be sharply peaked.

Page 58: Pattern Recognition  and  Machine Learning

Bayesian Model Comparison (5)

Taking logarithms, we obtain

With M parameters, all assumed to have the same ratio , we get

Negative

Negative and linear in M.

Page 59: Pattern Recognition  and  Machine Learning

Bayesian Model Comparison (5)

Data matching Model Complexity

Page 60: Pattern Recognition  and  Machine Learning

Bayesian Model Comparison (6)

Matching data and model complexity

Page 61: Pattern Recognition  and  Machine Learning

What You Should Know

• Least square and its maximum likelihood interpretation

• Sequential learning for regression• Regularization and its effect • Bayesian regression• Predictive distribution• Bias variance tradeoff• Bayesian model selection