machine learning - a. supervised learning a.3. regularization · xn i=1 log p(y i jx i; )! + xp j=1...

35
Machine Learning Machine Learning A. Supervised Learning A.3. Regularization Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 23

Upload: others

Post on 18-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Machine Learning

Machine LearningA. Supervised Learning

A.3. Regularization

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)Institute for Computer Science

University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

1 / 23

Machine Learning

SyllabusFri. 21.10. (1) 0. Introduction

A. Supervised LearningFri. 28.10. (2) A.1 Linear Regression

Fri. 4.11. (3) A.2 Linear ClassificationFri. 11.11. (4) A.3 RegularizationFri. 18.11. (5) A.4 High-dimensional DataFri. 25.11. (6) A.5 Nearest-Neighbor Models

Fri. 2.12. (7) A.6 Support Vector MachinesFri. 9.12. (8) A.7 Decision Trees

Fri. 16.12. (9) A.8 A First Look at Bayesian and Markov Networks

B. Unsupervised LearningFri. 13.1. (10) B.1 ClusteringFri. 20.1. (11) B.2 Dimensionality ReductionFri. 27.1. (12) B.3 Frequent Pattern Mining

C. Reinforcement LearningFri. 3.2. (13) C.1 State Space Models

(14) C.2 Markov Decision Processes

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

1 / 23

Machine Learning

Outline

1. The Problem of Overfitting

2. Model Selection

3. Regularization

4. Hyperparameter Optimization

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

1 / 23

Machine Learning 1. The Problem of Overfitting

Outline

1. The Problem of Overfitting

2. Model Selection

3. Regularization

4. Hyperparameter Optimization

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

1 / 23

Machine Learning 1. The Problem of Overfitting

Fitting of models

●●

●●

●●

●●

5 10 15 20 25

020

4060

8010

012

0

Linear model (RSS= 11353.52 )

Speed

Bre

akin

g di

stan

ce

●●

●●

●●

●●

5 10 15 20 25

020

4060

8010

012

0

Quadratic model (RSS= 10824.72 )

Speed

Bre

akin

g di

stan

ce

●●

●●

●●

●●

5 10 15 20 25

020

4060

8010

012

0

Polynomial model (RSS= 7029.37 )

Speed

Bre

akin

g di

stan

ce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

1 / 23

Machine Learning 1. The Problem of Overfitting

Underfitting/OverfittingUnderfitting:

I the model is not complex enough to explain the data well.

I results in poor predictive performance.

Overfitting:I the model is too complex, it describes the

I noise instead of theI underlying relationship between target and predictors.

I results in poor predictive performance as well.

Remark: Given N points (xn, yn) without repeated measurements (i.e.xn 6= xm, n 6= m), there exists a polynomial of degree N − 1 with RSSequal to 0.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

2 / 23

Machine Learning 2. Model Selection

Outline

1. The Problem of Overfitting

2. Model Selection

3. Regularization

4. Hyperparameter Optimization

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

3 / 23

Machine Learning 2. Model Selection

Model Selection MeasuresModel selection means: we have a set of models, e.g.,

Y =

p−1∑i=0

βiXi

indexed by p (i.e., one model for each value of p), make a choice whichmodel describes the data best.If we just look at losses / fit measures such as RSS, then

the larger p, the better the fit

or equivalently

the larger p, the lower the loss

as the model with p parameters can be reparametrized in a model withp′ > p parameters by setting

β′i =

{βi , for i ≤ p0, for i > p

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

3 / 23

Machine Learning 2. Model Selection

Model Selection MeasuresOne uses model selection measures of type

model selection measure = fit− complexity

or equivalently

model selection measure = loss + complexity

The smaller the loss (= lack of fit), the better the model.

The smaller the complexity, the simpler and thus better the model.

The model selection measure tries to find a trade-off between fit/loss andcomplexity.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

4 / 23

Machine Learning 2. Model Selection

Model Selection Measures

Akaike Information Criterion (AIC): (maximize)

AIC := log L− p

or (minimize)AIC := −2 log L + 2p

Bayes Information Criterion (BIC) /Bayes-Schwarz Information Criterion: (maximize)

BIC := log L− p

2logN

where L denotes the likelihood, N the number of samples.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

5 / 23

Machine Learning 2. Model Selection

Variable Backward Selection

{ A, F, H, I, J, L, P } AIC = 63.01

{ A, F, H, I, J, L, P }AIC = 63.87

{ A, F, H, I, J, L, P }AIC = 61.11

{ A, F, H, I, J, L, P }AIC = 70.17

X

X X X... ...

{ A, F, H, I, J, L, P }AIC = 61.88

{ A, F, H, I, J, L, P }AIC = 59.40

{ A, F, H, I, J, L, P }AIC = 68.70

... ...X X X XX X

{ A, F, H, I, J, L, P }AIC = 63.23

{ A, F, H, I, J, L, P }AIC = 61.50

{ A, F, H, I, J, L, P }AIC = 66.71

...XXX X XX XX X

removed variable

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

6 / 23

Machine Learning 2. Model Selection

Variable Backward Selection

{ A, F, H, I, J, L, P } AIC = 63.01

{ A, F, H, I, J, L, P }AIC = 63.87

{ A, F, H, I, J, L, P }AIC = 61.11

{ A, F, H, I, J, L, P }AIC = 70.17

X

X X X... ...

{ A, F, H, I, J, L, P }AIC = 61.88

{ A, F, H, I, J, L, P }AIC = 59.40

{ A, F, H, I, J, L, P }AIC = 68.70

... ...X X X XX X

{ A, F, H, I, J, L, P }AIC = 63.23

{ A, F, H, I, J, L, P }AIC = 61.50

{ A, F, H, I, J, L, P }AIC = 66.71

...XXX X XX XX X

removed variableLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

6 / 23

Machine Learning 2. Model Selection

Variable Backward Selection

{ A, F, H, I, J, L, P } AIC = 63.01

{ A, F, H, I, J, L, P }AIC = 63.87

{ A, F, H, I, J, L, P }AIC = 61.11

{ A, F, H, I, J, L, P }AIC = 70.17

X

X X X... ...

{ A, F, H, I, J, L, P }AIC = 61.88

{ A, F, H, I, J, L, P }AIC = 59.40

{ A, F, H, I, J, L, P }AIC = 68.70

... ...X X X XX X

{ A, F, H, I, J, L, P }AIC = 63.23

{ A, F, H, I, J, L, P }AIC = 61.50

{ A, F, H, I, J, L, P }AIC = 66.71

...XXX X XX XX X

removed variableLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

6 / 23

Machine Learning 2. Model Selection

Variable Backward Selection

{ A, F, H, I, J, L, P } AIC = 63.01

{ A, F, H, I, J, L, P }AIC = 63.87

{ A, F, H, I, J, L, P }AIC = 61.11

{ A, F, H, I, J, L, P }AIC = 70.17

X

X X X... ...

{ A, F, H, I, J, L, P }AIC = 61.88

{ A, F, H, I, J, L, P }AIC = 59.40

{ A, F, H, I, J, L, P }AIC = 68.70

... ...X X X XX X

{ A, F, H, I, J, L, P }AIC = 63.23

{ A, F, H, I, J, L, P }AIC = 61.50

{ A, F, H, I, J, L, P }AIC = 66.71

...XXX X XX XX X

removed variable

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

6 / 23

Machine Learning 3. Regularization

Outline

1. The Problem of Overfitting

2. Model Selection

3. Regularization

4. Hyperparameter Optimization

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

7 / 23

Machine Learning 3. Regularization

Shrinkage

Model selection operates by

I fitting models for a set of models with varying complexity and thenpicking the ”best one” ex post,

I omitting some parameters completely (i.e. forcing them to be 0).

Shrinkage follows a similar idea:

I smaller parameters mean a simpler hypothesis/less complex model.Hence, small parameters should be prefered in general.

I a term is added to the model equation to penalize high parametersinstead of forcing them to be 0.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

7 / 23

Machine Learning 3. Regularization

Shrinkage

There are various types of shrinkage techniques for different applicationdomains.

L1/Lasso Regularization: λ∑p

j=1

∣∣∣βj ∣∣∣ = λ∥∥∥β∥∥∥

1

L2/Tikhonov Regularization: λ∑p

j=1 β2j = λ

∥∥∥β∥∥∥2

2

Elastic Net: λ1

∥∥∥β∥∥∥1

+ λ2

∥∥∥β∥∥∥2

2

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

8 / 23

Machine Learning 3. Regularization

Ridge Regression

Ridge regression is a combination of

n∑i=1

(yi − yi )2

︸ ︷︷ ︸+λ

p∑j=1

β2j︸ ︷︷ ︸

= L2 loss +λ L2 regularization

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

9 / 23

Machine Learning 3. Regularization

Ridge Regression (Closed Form)Ridge regression: minimize

RSSλ(β) =RSS(β) + λ

p∑j=1

β2j = 〈y − Xβ, y − Xβ〉+ λ

p∑j=1

β2j

⇒ β =(

XTX + 2λI)−1

XTy, I :=

1 0 · · · 0

0 1. . .

......

. . .. . . 0

0 · · · 0 1

with λ ≥ 0 a complexity parameter / regularization parameter.

As solutions of ridge regression are not equivariant under scaling of thepredictors, data is normalized before ridge regression:

x ′i ,j :=xi ,j − x.,jσ(x.,j)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

10 / 23

Machine Learning 3. Regularization

Ridge Regression (Gradient Descent)1: procedure Ridge-Regr-

GD(y : RP → R, β(0) ∈ RP+1, α, tmax ∈ N,X ∈ RN×P , y ∈ RN)2: for t = 1, . . . , tmax do

3: β(t)0 := β

(t−1)0 − α

(2∑N

n=1− (yn − y (Xn)))

4: for j = 1, . . . ,P do5:

β(t)j := β

(t−1)j − α

(2∑N

n=1−Xn,j (yn − y (Xn)) + 2λβ(t−1)j

)6: if converged then7: return β(t)

L2-Regularized Update Rule

β(t)j := (1− 2αλ)︸ ︷︷ ︸

shrinkage

β(t−1)j − α

(2

N∑n=1

−Xn,j (yn − y (Xn))

)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

11 / 23

Machine Learning 3. Regularization

Tikhonov Regularization Derivation (1/2)Treat the true parameters θj as random variables Θj with the followingdistribution (prior):

Θj ∼ N (0, σΘ), j = 1, . . . , p

Then the joint likelihood of the data and the parameters is

LD,Θ(θ) :=

(n∏

i=1

p(xi , yi | θ)

)p∏

j=1

p(Θj = θj)

and the conditional joint log likelihood of the data and the parameters

log LcondD,Θ (θ) :=

(n∑

i=1

log p(yi | xi , θ)

)+

p∑j=1

log p(Θj = θj)

and

log p(Θj = θj) = log1√

2πσΘ

e−

θ2j

2σ2Θ = − log(

√2πσΘ)−

θ2j

2σ2Θ

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

12 / 23

Machine Learning 3. Regularization

Tikhonov Regularization Derivation (2/2)Dropping the terms that do not depend on θj yields:

log LcondD,Θ (θ) :=

(n∑

i=1

log p(yi | xi , θ)

)+

p∑j=1

log p(Θj = θj)

(n∑

i=1

log p(yi | xi , θ)

)− 1

2σ2Θ

p∑j=1

θ2j

This also gives a semantics to the complexity / regularization parameter λ:

λ =1

2σ2Θ

but σ2Θ is unknown. (We will see methods to estimate λ soon.)

The parameters θ that maximize the joint likelihood of the data and theparameters are called Maximum Aposteriori Estimators (MAPestimators).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

13 / 23

Machine Learning 3. Regularization

L2-Regularized Logistic Regression (Gradient Descent)

log LcondD (β) =

n∑i=1

yi 〈xi , β〉 − log(1 + e〈xi ,β〉)−2λP∑j=1

β2j

1: procedure Log-Regr-GA(Lcond

D : RP+1 → R, β(0) ∈ RP+1, α, tmax ∈ N, ε ∈ R+)2: for t = 1, . . . , tmax do

3: β(t)0 := β

(t−1)0 + α

∑ni=1

(yi − p

(Y = 1|X = xi ; β

(t−1)))

4: for j = 1, . . . ,P do

5: β(t)j :=

β(t−1)j + α(

∑ni=1 xi ,j

(yi − p

(Y = 1|X = xi ; β

(t−1)))−2λβ

(t−1)j )

6: if LcondD (β(t−1))− Lcond

D (β(t))) < ε then7: return β(t)

8: error ”not converged in tmax iterations”

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

14 / 23

Machine Learning 3. Regularization

L2-Regularized Logistic Regression (Newton)

Newton update rule:

β(t) := β(t−1) + αH−1∇βp(Y = 1|X = xi ; β

(t−1))

pi = p(Y = 1|X = xi ; β

(t−1))

∇βLcondD =

∑n

i=1− (yi − pi )∑ni=1−xi ,1 (yi − pi )−2λβ1

...∑ni=1−xi ,P (yi − pi )−2λβP

H =

n∑i=1

−pi (1− pi ) xixTi −2λI

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

15 / 23

Machine Learning 4. Hyperparameter Optimization

Outline

1. The Problem of Overfitting

2. Model Selection

3. Regularization

4. Hyperparameter Optimization

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

16 / 23

Machine Learning 4. Hyperparameter Optimization

What is Hyperparameter Optimization?

Many learning algorithms Aλ have hyperparameters λ (learning rate,regularization). After choosing them, Aλ can be used to map the trainingdata Dtrain to a function y by minimizing some loss L(x ; y).

Identifying good values for the hyperparameters λ is calledhyperparameter optimization.

Hence, hyperparameter optimization is a second level optimization

argminλ∈Λ1

|Dcalib|∑

x∈Dcalib

L (x ;Aλ (Dtrain)) = argminλ∈ΛΨ(λ)

where Ψ is the hyperparameter response function and Dcalib acalibration set.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

16 / 23

Machine Learning 4. Hyperparameter Optimization

Why Hyperparameter Optimization

I So far only model parameters were optimized.

I Hyperparameters (such as regularization λ and learning rate α) came“out of the blue”.

I Hyperparameters can have a big impact on the prediction quality.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

17 / 23

Machine Learning 4. Hyperparameter Optimization

Grid SearchI Choose for each hyperparameter a set of values Λ1, . . . ,Λq.I Λ =

∏qi=1 Λi is then the combination of all hyperparameters in all Λi s.

I Then choose the hyperparameter λ ∈ Λ with best performance onDcalib.

●● ● ●●● ● ●

●● ● ●

●● ● ●

0.00 0.02 0.04 0.06 0.08 0.10

0.00

0.02

0.04

0.06

0.08

0.10

learning rate

regu

lariz

atio

n

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

18 / 23

Machine Learning 4. Hyperparameter Optimization

Random SearchI Instead of choosing hyperparameters on a grid, choose random

hyperparameters λ for Λ (within a reasonable space).I Provides better results than grid search in cases of insensitive

hyperparameters.

0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

learning rate

regu

lariz

atio

n

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

19 / 23

Machine Learning 4. Hyperparameter Optimization

What is the Calibration Data?

Whenever a learning process depends on a hyperparameter, thehyperparameter can be estimated by picking the value with the lowesterror.

If this is done on test data, one actually uses test data in the trainingprocess (“train on test”), thereby lessen its usefullness for estimating thetest error.

Therefore, one splits the training data again in

I (proper) training data and

I calibration data.

The calibration data figures as test data during the training process.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

20 / 23

Machine Learning 4. Hyperparameter Optimization

Cross Validation

Instead of a single split into

training data, (validation data,) and test data

cross validation splits the data in k parts (of roughly equal size)

D = D1 ∪ D2 ∪ · · · ∪ Dk , Di pairwise disjunct

and averages performance over k learning problems

D(i)train = D \ Di , D

(i)test = Di i = 1, . . . , k

Common is 5- and 10-fold cross validation.

n-fold cross validation is also known as leave one out.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

21 / 23

Machine Learning 4. Hyperparameter Optimization

Cross Validation

How many folds to use in k-fold cross validation?

k = n / leave one out:

I approximately unbiased for the true prediction error.

I high variance as the n training sets are very similar.

I in general computationally costly as n different modelshave to be learnt.

k = 5:

I lower variance.

I bias could be a problem,due to smaller training set size the prediction error couldbe overestimated.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

22 / 23

Machine Learning 4. Hyperparameter Optimization

SummaryI The problem of underfitting can be overcome by using

more complex models, e.g., havingI variable interactions as in polynomial models.

I The problem of overfitting can be overcome byI model / variable selection as well as byI (parameter) shrinkage.

I Applying L2-regularization to Linear and Logistic Regression requiresonly few changes in the learning algorithm

I Shrinkage introduces a hyperparameter λ that cannot be learned bydirect loss minimization.

I Estimating the best hyperparameters can be considered as ameta-learning problem. They can be estimated e.g. by

I Grid Search orI Random Search.

using validation data.Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

23 / 23

Machine Learning

Further Readings

I [JWHT13, chapter 3], [Mur12, chapter 7], [HTFF05, chapter 3].

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

24 / 23

Machine Learning

References

Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin.

The elements of statistical learning: data mining, inference and prediction, volume 27.2005.

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.

An introduction to statistical learning.Springer, 2013.

Kevin P. Murphy.

Machine learning: a probabilistic perspective.The MIT Press, 2012.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

25 / 23