machine learning - a. supervised learning a.3. regularization · xn i=1 log p(y i jx i; )! + xp j=1...

Machine Learning

Machine LearningA. Supervised Learning

A.3. Regularization

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)Institute for Computer Science

University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

1 / 23

Machine Learning

SyllabusFri. 21.10. (1) 0. Introduction

A. Supervised LearningFri. 28.10. (2) A.1 Linear Regression

Fri. 4.11. (3) A.2 Linear ClassificationFri. 11.11. (4) A.3 RegularizationFri. 18.11. (5) A.4 High-dimensional DataFri. 25.11. (6) A.5 Nearest-Neighbor Models

Fri. 2.12. (7) A.6 Support Vector MachinesFri. 9.12. (8) A.7 Decision Trees

Fri. 16.12. (9) A.8 A First Look at Bayesian and Markov Networks

B. Unsupervised LearningFri. 13.1. (10) B.1 ClusteringFri. 20.1. (11) B.2 Dimensionality ReductionFri. 27.1. (12) B.3 Frequent Pattern Mining

C. Reinforcement LearningFri. 3.2. (13) C.1 State Space Models

(14) C.2 Markov Decision Processes


1 / 23

Machine Learning

Outline

1. The Problem of Overfitting

2. Model Selection

3. Regularization

4. Hyperparameter Optimization


1 / 23

Machine Learning 1. The Problem of Overfitting

Outline


2. Model Selection

3. Regularization



1 / 23


Fitting of models

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

5 10 15 20 25

020

4060

8010

012

0

Linear model (RSS= 11353.52 )

Speed

Bre

akin

g di

stan

ce

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

5 10 15 20 25

020

4060

8010

012

0

Quadratic model (RSS= 10824.72 )

Speed

Bre

akin

g di

stan

ce

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

5 10 15 20 25

020

4060

8010

012

0

Polynomial model (RSS= 7029.37 )

Speed

Bre

akin

g di

stan

ce


1 / 23


Underfitting/OverfittingUnderfitting:

I the model is not complex enough to explain the data well.

I results in poor predictive performance.

Overfitting:I the model is too complex, it describes the

I noise instead of theI underlying relationship between target and predictors.

I results in poor predictive performance as well.

Remark: Given N points (xn, yn) without repeated measurements (i.e.xn 6= xm, n 6= m), there exists a polynomial of degree N − 1 with RSSequal to 0.


2 / 23

Machine Learning 2. Model Selection

Outline


2. Model Selection

3. Regularization



3 / 23


Model Selection MeasuresModel selection means: we have a set of models, e.g.,

Y =

p−1∑i=0

βiXi

indexed by p (i.e., one model for each value of p), make a choice whichmodel describes the data best.If we just look at losses / fit measures such as RSS, then

the larger p, the better the fit

or equivalently

the larger p, the lower the loss

as the model with p parameters can be reparametrized in a model withp′ > p parameters by setting

β′i =

{βi , for i ≤ p0, for i > p


3 / 23


Model Selection MeasuresOne uses model selection measures of type

model selection measure = fit− complexity

or equivalently

model selection measure = loss + complexity

The smaller the loss (= lack of fit), the better the model.

The smaller the complexity, the simpler and thus better the model.

The model selection measure tries to find a trade-off between fit/loss andcomplexity.


4 / 23


Model Selection Measures

Akaike Information Criterion (AIC): (maximize)

AIC := log L− p

or (minimize)AIC := −2 log L + 2p

Bayes Information Criterion (BIC) /Bayes-Schwarz Information Criterion: (maximize)

BIC := log L− p

2logN

where L denotes the likelihood, N the number of samples.


5 / 23


Variable Backward Selection

{ A, F, H, I, J, L, P } AIC = 63.01

{ A, F, H, I, J, L, P }AIC = 63.87

{ A, F, H, I, J, L, P }AIC = 61.11

{ A, F, H, I, J, L, P }AIC = 70.17

X

X X X... ...

{ A, F, H, I, J, L, P }AIC = 61.88

{ A, F, H, I, J, L, P }AIC = 59.40

{ A, F, H, I, J, L, P }AIC = 68.70

... ...X X X XX X

{ A, F, H, I, J, L, P }AIC = 63.23

{ A, F, H, I, J, L, P }AIC = 61.50

{ A, F, H, I, J, L, P }AIC = 66.71

...XXX X XX XX X

removed variable


6 / 23



{ A, F, H, I, J, L, P } AIC = 63.01

{ A, F, H, I, J, L, P }AIC = 63.87

{ A, F, H, I, J, L, P }AIC = 61.11

{ A, F, H, I, J, L, P }AIC = 70.17

X

X X X... ...

{ A, F, H, I, J, L, P }AIC = 61.88

{ A, F, H, I, J, L, P }AIC = 59.40

{ A, F, H, I, J, L, P }AIC = 68.70

... ...X X X XX X

{ A, F, H, I, J, L, P }AIC = 63.23

{ A, F, H, I, J, L, P }AIC = 61.50

{ A, F, H, I, J, L, P }AIC = 66.71

...XXX X XX XX X

removed variableLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

6 / 23



{ A, F, H, I, J, L, P } AIC = 63.01

{ A, F, H, I, J, L, P }AIC = 63.87

{ A, F, H, I, J, L, P }AIC = 61.11

{ A, F, H, I, J, L, P }AIC = 70.17

X

X X X... ...

{ A, F, H, I, J, L, P }AIC = 61.88

{ A, F, H, I, J, L, P }AIC = 59.40

{ A, F, H, I, J, L, P }AIC = 68.70

... ...X X X XX X

{ A, F, H, I, J, L, P }AIC = 63.23

{ A, F, H, I, J, L, P }AIC = 61.50

{ A, F, H, I, J, L, P }AIC = 66.71

...XXX X XX XX X

removed variable


6 / 23

Machine Learning 3. Regularization

Outline


2. Model Selection

3. Regularization



7 / 23


Shrinkage

Model selection operates by

I fitting models for a set of models with varying complexity and thenpicking the ”best one” ex post,

I omitting some parameters completely (i.e. forcing them to be 0).

Shrinkage follows a similar idea:

I smaller parameters mean a simpler hypothesis/less complex model.Hence, small parameters should be prefered in general.

I a term is added to the model equation to penalize high parametersinstead of forcing them to be 0.


7 / 23


Shrinkage

There are various types of shrinkage techniques for different applicationdomains.

L1/Lasso Regularization: λ∑p

j=1

∣∣∣βj ∣∣∣ = λ∥∥∥β∥∥∥

1

L2/Tikhonov Regularization: λ∑p

j=1 β2j = λ

∥∥∥β∥∥∥2

2

Elastic Net: λ1

∥∥∥β∥∥∥1

+ λ2

∥∥∥β∥∥∥2

2


8 / 23


Ridge Regression

Ridge regression is a combination of

n∑i=1

(yi − yi )2

︸︷︷︸+λ

p∑j=1

β2j︸︷︷︸

= L2 loss +λ L2 regularization


9 / 23


Ridge Regression (Closed Form)Ridge regression: minimize

RSSλ(β) =RSS(β) + λ

p∑j=1

β2j = 〈y − Xβ, y − Xβ〉+ λ

p∑j=1

β2j

⇒ β =(

XTX + 2λI)−1

XTy, I :=

1 0 · · · 0

0 1. . .

......

. . .. . . 0

0 · · · 0 1

with λ ≥ 0 a complexity parameter / regularization parameter.

As solutions of ridge regression are not equivariant under scaling of thepredictors, data is normalized before ridge regression:

x ′i ,j :=xi ,j − x.,jσ(x.,j)


10 / 23


Ridge Regression (Gradient Descent)1: procedure Ridge-Regr-

GD(y : RP → R, β(0) ∈ RP+1, α, tmax ∈ N,X ∈ RN×P , y ∈ RN)2: for t = 1, . . . , tmax do

3: β(t)0 := β

(t−1)0 − α

(2∑N

n=1− (yn − y (Xn)))

4: for j = 1, . . . ,P do5:

β(t)j := β

(t−1)j − α

(2∑N

n=1−Xn,j (yn − y (Xn)) + 2λβ(t−1)j

)6: if converged then7: return β(t)

L2-Regularized Update Rule

β(t)j := (1− 2αλ)︸︷︷︸

shrinkage

β(t−1)j − α

(2

N∑n=1

−Xn,j (yn − y (Xn))

)


11 / 23


Tikhonov Regularization Derivation (1/2)Treat the true parameters θj as random variables Θj with the followingdistribution (prior):

Θj ∼ N (0, σΘ), j = 1, . . . , p

Then the joint likelihood of the data and the parameters is

LD,Θ(θ) :=

(n∏

i=1

p(xi , yi | θ)

)p∏

j=1

p(Θj = θj)

and the conditional joint log likelihood of the data and the parameters

log LcondD,Θ (θ) :=

(n∑

i=1

log p(yi | xi , θ)

)+

p∑j=1

log p(Θj = θj)

and

log p(Θj = θj) = log1√

2πσΘ

e−

θ2j

2σ2Θ = − log(

√2πσΘ)−

θ2j

2σ2Θ


12 / 23


Tikhonov Regularization Derivation (2/2)Dropping the terms that do not depend on θj yields:

log LcondD,Θ (θ) :=

(n∑

i=1

log p(yi | xi , θ)

)+

p∑j=1

log p(Θj = θj)

∝

(n∑

i=1

log p(yi | xi , θ)

)− 1

2σ2Θ

p∑j=1

θ2j

This also gives a semantics to the complexity / regularization parameter λ:

λ =1

2σ2Θ

but σ2Θ is unknown. (We will see methods to estimate λ soon.)

The parameters θ that maximize the joint likelihood of the data and theparameters are called Maximum Aposteriori Estimators (MAPestimators).


13 / 23


L2-Regularized Logistic Regression (Gradient Descent)

log LcondD (β) =

n∑i=1

yi 〈xi , β〉 − log(1 + e〈xi ,β〉)−2λP∑j=1

β2j

1: procedure Log-Regr-GA(Lcond

D : RP+1 → R, β(0) ∈ RP+1, α, tmax ∈ N, ε ∈ R+)2: for t = 1, . . . , tmax do

3: β(t)0 := β

(t−1)0 + α

∑ni=1

(yi − p

(Y = 1|X = xi ; β

(t−1)))

4: for j = 1, . . . ,P do

5: β(t)j :=

β(t−1)j + α(

∑ni=1 xi ,j

(yi − p

(Y = 1|X = xi ; β

(t−1)))−2λβ

(t−1)j )

6: if LcondD (β(t−1))− Lcond

D (β(t))) < ε then7: return β(t)

8: error ”not converged in tmax iterations”


14 / 23


L2-Regularized Logistic Regression (Newton)

Newton update rule:

β(t) := β(t−1) + αH−1∇βp(Y = 1|X = xi ; β

(t−1))

pi = p(Y = 1|X = xi ; β

(t−1))

∇βLcondD =

∑n

i=1− (yi − pi )∑ni=1−xi ,1 (yi − pi )−2λβ1

...∑ni=1−xi ,P (yi − pi )−2λβP

H =

n∑i=1

−pi (1− pi ) xixTi −2λI


15 / 23

Machine Learning 4. Hyperparameter Optimization

Outline


2. Model Selection

3. Regularization



16 / 23


What is Hyperparameter Optimization?

Many learning algorithms Aλ have hyperparameters λ (learning rate,regularization). After choosing them, Aλ can be used to map the trainingdata Dtrain to a function y by minimizing some loss L(x ; y).

Identifying good values for the hyperparameters λ is calledhyperparameter optimization.

Hence, hyperparameter optimization is a second level optimization

argminλ∈Λ1

|Dcalib|∑

x∈Dcalib

L (x ;Aλ (Dtrain)) = argminλ∈ΛΨ(λ)

where Ψ is the hyperparameter response function and Dcalib acalibration set.


16 / 23


Why Hyperparameter Optimization

I So far only model parameters were optimized.

I Hyperparameters (such as regularization λ and learning rate α) came“out of the blue”.

I Hyperparameters can have a big impact on the prediction quality.


17 / 23


Grid SearchI Choose for each hyperparameter a set of values Λ1, . . . ,Λq.I Λ =

∏qi=1 Λi is then the combination of all hyperparameters in all Λi s.

I Then choose the hyperparameter λ ∈ Λ with best performance onDcalib.

●● ● ●●● ● ●

●● ● ●

●● ● ●

0.00 0.02 0.04 0.06 0.08 0.10

0.00

0.02

0.04

0.06

0.08

0.10

learning rate

regu

lariz

atio

n


18 / 23


Random SearchI Instead of choosing hyperparameters on a grid, choose random

hyperparameters λ for Λ (within a reasonable space).I Provides better results than grid search in cases of insensitive

hyperparameters.

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

learning rate

regu

lariz

atio

n


19 / 23


What is the Calibration Data?

Whenever a learning process depends on a hyperparameter, thehyperparameter can be estimated by picking the value with the lowesterror.

If this is done on test data, one actually uses test data in the trainingprocess (“train on test”), thereby lessen its usefullness for estimating thetest error.

Therefore, one splits the training data again in

I (proper) training data and

I calibration data.

The calibration data figures as test data during the training process.


20 / 23


Cross Validation

Instead of a single split into

training data, (validation data,) and test data

cross validation splits the data in k parts (of roughly equal size)

D = D1 ∪ D2 ∪ · · · ∪ Dk , Di pairwise disjunct

and averages performance over k learning problems

D(i)train = D \ Di , D

(i)test = Di i = 1, . . . , k

Common is 5- and 10-fold cross validation.

n-fold cross validation is also known as leave one out.


21 / 23


Cross Validation

How many folds to use in k-fold cross validation?

k = n / leave one out:

I approximately unbiased for the true prediction error.

I high variance as the n training sets are very similar.

I in general computationally costly as n different modelshave to be learnt.

k = 5:

I lower variance.

I bias could be a problem,due to smaller training set size the prediction error couldbe overestimated.


22 / 23


SummaryI The problem of underfitting can be overcome by using

more complex models, e.g., havingI variable interactions as in polynomial models.

I The problem of overfitting can be overcome byI model / variable selection as well as byI (parameter) shrinkage.

I Applying L2-regularization to Linear and Logistic Regression requiresonly few changes in the learning algorithm

I Shrinkage introduces a hyperparameter λ that cannot be learned bydirect loss minimization.

I Estimating the best hyperparameters can be considered as ameta-learning problem. They can be estimated e.g. by

I Grid Search orI Random Search.

using validation data.Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

23 / 23

Machine Learning

Further Readings

I [JWHT13, chapter 3], [Mur12, chapter 7], [HTFF05, chapter 3].


24 / 23

Machine Learning

References

Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin.

The elements of statistical learning: data mining, inference and prediction, volume 27.2005.

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.

An introduction to statistical learning.Springer, 2013.

Kevin P. Murphy.

Machine learning: a probabilistic perspective.The MIT Press, 2012.


25 / 23

machine learning - a. supervised learning a.3. regularization · xn i=1 log p(y i jx i; )! + xp j=1...

Documents