dimension reduction methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · i...

50
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28

Upload: others

Post on 15-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Dimension Reduction MethodsAnd Bayesian Machine Learning

Marek Petrik

2/28

Page 2: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Previously in Machine Learning

I How to choose the right features if we have (too) many options

I Methods:1. Subset selection2. Regularization (shrinkage)3. Dimensionality reduction (next class)

Page 3: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Best Subset Selection

I Want to find a subset of p featuresI The subset should be small and predict wellI Example: credit ∼ rating + income + student + limit

M0 ← null model (no features);for k = 1, 2, . . . , p do

Fit all(pk

)models that contain k features ;

Mk ← best of(pk

)models according to a metric (CV error, R2,

etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above

Algorithm 1: Best Subset Selection

Page 4: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Achieving Scalability

I Complexity of Best Subset Selection?

I Examine all possible subsets? How many?I O(2p)!

I Heuristic approaches:1. Stepwise selection: Solve the problem approximately: greedy2. Regularization: Solve a di�erent (easier) problem: relaxation

Page 5: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Achieving Scalability

I Complexity of Best Subset Selection?I Examine all possible subsets? How many?

I O(2p)!

I Heuristic approaches:1. Stepwise selection: Solve the problem approximately: greedy2. Regularization: Solve a di�erent (easier) problem: relaxation

Page 6: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Achieving Scalability

I Complexity of Best Subset Selection?I Examine all possible subsets? How many?I O(2p)!

I Heuristic approaches:1. Stepwise selection: Solve the problem approximately: greedy2. Regularization: Solve a di�erent (easier) problem: relaxation

Page 7: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Achieving Scalability

I Complexity of Best Subset Selection?I Examine all possible subsets? How many?I O(2p)!

I Heuristic approaches:1. Stepwise selection: Solve the problem approximately: greedy2. Regularization: Solve a di�erent (easier) problem: relaxation

Page 8: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Which Metric to Use?M0 ← null model (no features);for k = 1, 2, . . . , p do

Fit all(pk

)models that contain k features ;

Mk ← best of(pk

)models according to a metric (CV error, R2,

etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above

Algorithm 2: Best Subset Selection

1. Direct error estimate: Cross validation, precise butcomputationally intensive

2. Indirect error estimate: Mellow’s Cp:

Cp =1

n(RSS +2dσ̂2) where σ̂2 ≈ Var[ε]

Akaike information criterion, BIC, and many others.Theoretical foundations

3. Interpretability Penalty: What is the cost of extra features

Page 9: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Which Metric to Use?M0 ← null model (no features);for k = 1, 2, . . . , p do

Fit all(pk

)models that contain k features ;

Mk ← best of(pk

)models according to a metric (CV error, R2,

etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above

Algorithm 3: Best Subset Selection

1. Direct error estimate: Cross validation, precise butcomputationally intensive

2. Indirect error estimate: Mellow’s Cp:

Cp =1

n(RSS +2dσ̂2) where σ̂2 ≈ Var[ε]

Akaike information criterion, BIC, and many others.Theoretical foundations

3. Interpretability Penalty: What is the cost of extra features

Page 10: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Which Metric to Use?M0 ← null model (no features);for k = 1, 2, . . . , p do

Fit all(pk

)models that contain k features ;

Mk ← best of(pk

)models according to a metric (CV error, R2,

etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above

Algorithm 4: Best Subset Selection

1. Direct error estimate: Cross validation, precise butcomputationally intensive

2. Indirect error estimate: Mellow’s Cp:

Cp =1

n(RSS +2dσ̂2) where σ̂2 ≈ Var[ε]

Akaike information criterion, BIC, and many others.Theoretical foundations

3. Interpretability Penalty: What is the cost of extra features

Page 11: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Which Metric to Use?M0 ← null model (no features);for k = 1, 2, . . . , p do

Fit all(pk

)models that contain k features ;

Mk ← best of(pk

)models according to a metric (CV error, R2,

etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above

Algorithm 5: Best Subset Selection

1. Direct error estimate: Cross validation, precise butcomputationally intensive

2. Indirect error estimate: Mellow’s Cp:

Cp =1

n(RSS +2dσ̂2) where σ̂2 ≈ Var[ε]

Akaike information criterion, BIC, and many others.Theoretical foundations

3. Interpretability Penalty: What is the cost of extra features

Page 12: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Regularization

1. Stepwise selection: Solve the problem approximately

2. Regularization: Solve a di�erent (easier) problem: relaxationI Solve a machine learning problem, but penalize solutions that

use “too much” of the features

Page 13: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Regularization

I Ridge regression (parameter λ), `2 penalty

minβ

RSS(β) + λ∑j

β2j =

minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

+ λ∑j

β2j

I Lasso (parameter λ), `1 penalty

minβ

RSS(β) + λ∑j

|βj | =

minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

+ λ∑j

|βj |

I Approximations to the `0 solution

Page 14: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Why Lasso WorksI Bias-variance trade-o�I Increasing λ increases biasI Example: all features relevant

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

purple: test MSE, black: bias, green: variancedo�ed (ridge)

Page 15: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Why Lasso WorksI Bias-variance trade-o�I Increasing λ increases biasI Example: some features relevant

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

purple: test MSE, black: bias, green: variancedo�ed (ridge)

Page 16: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Regularization

I Ridge regression (parameter λ), `2 penalty

minβ

RSS(β) + λ∑j

β2j =

minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

+ λ∑j

β2j

I Lasso (parameter λ), `1 penalty

minβ

RSS(β) + λ∑j

|βj | =

minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

+ λ∑j

|βj |

I Approximations to the `0 solution

Page 17: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Regularization: Constrained Formulation

I Ridge regression (parameter λ), `2 penalty

minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

subject to∑j

β2j ≤ s

I Lasso (parameter λ), `1 penalty

minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

subject to∑j

|βj | ≤ s

I Approximations to the `0 solution

Page 18: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Lasso Solutions are Sparse

Constrained Lasso (le�) vs Constrained Ridge Regression (right)

Constraints are blue, red are contours of the objective

Page 19: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Today

I Dimension reduction methodsI Principal component regressionI Partial least squares

I Interpretation in high dimensionsI Bayesian view of ridge regression and lasso

Page 20: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Dimensionality Reduction Methods

I Di�erent approach to model selectionI We have many features: X1, X2, . . . , Xp

I Transform features to a smaller number Z1, . . . , ZM

I Find constants φjmI New features Zm are linear combinations of Xj :

Zm =

p∑j=1

φjmXj

I Dimension reduction: M is much smaller than p

Page 21: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Dimensionality Reduction Methods

I Di�erent approach to model selectionI We have many features: X1, X2, . . . , Xp

I Transform features to a smaller number Z1, . . . , ZM

I Find constants φjmI New features Zm are linear combinations of Xj :

Zm =

p∑j=1

φjmXj

I Dimension reduction: M is much smaller than p

Page 22: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Dimensionality Reduction Methods

I Di�erent approach to model selectionI We have many features: X1, X2, . . . , Xp

I Transform features to a smaller number Z1, . . . , ZM

I Find constants φjmI New features Zm are linear combinations of Xj :

Zm =

p∑j=1

φjmXj

I Dimension reduction: M is much smaller than p

Page 23: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Using Transformed Features

I New features Zm are linear combinations of Xj :

Zm =

p∑j=1

φjmXj

I Fit linear regression model:

yi = θ0 +

M∑m=1

θmzim + εi

I Run plain linear regression, logistic regression, LDA, oranything else

Page 24: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Recovering Coe�icients for Original Features

I Prediction using transformed features

yi = θ0 +

M∑m=1

θmzim + εi

I New features Zm are linear combinations of Xj :

Zm =

p∑j=1

φjmXj

I Consider prediction for data point i

M∑m=1

θmzim

=

M∑m=1

θm

p∑j=1

φjmxij =

p∑j=1

M∑m=1

θmφjmxij =

p∑j=1

βjxij

Page 25: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Recovering Coe�icients for Original Features

I Prediction using transformed features

yi = θ0 +

M∑m=1

θmzim + εi

I New features Zm are linear combinations of Xj :

Zm =

p∑j=1

φjmXj

I Consider prediction for data point i

M∑m=1

θmzim =

M∑m=1

θm

p∑j=1

φjmxij

=

p∑j=1

M∑m=1

θmφjmxij =

p∑j=1

βjxij

Page 26: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Recovering Coe�icients for Original Features

I Prediction using transformed features

yi = θ0 +

M∑m=1

θmzim + εi

I New features Zm are linear combinations of Xj :

Zm =

p∑j=1

φjmXj

I Consider prediction for data point i

M∑m=1

θmzim =

M∑m=1

θm

p∑j=1

φjmxij =

p∑j=1

M∑m=1

θmφjmxij

=

p∑j=1

βjxij

Page 27: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Recovering Coe�icients for Original Features

I Prediction using transformed features

yi = θ0 +

M∑m=1

θmzim + εi

I New features Zm are linear combinations of Xj :

Zm =

p∑j=1

φjmXj

I Consider prediction for data point i

M∑m=1

θmzim =

M∑m=1

θm

p∑j=1

φjmxij =

p∑j=1

M∑m=1

θmφjmxij =

p∑j=1

βjxij

Page 28: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Dimension Reduction

1. Reduce dimensions of features Z from X

2. Fit prediction model to compute θ

3. Compute weights for the original features β

Page 29: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Dimension Reduction

1. Reduce dimensions of features Z from X

2. Fit prediction model to compute θ

3. Compute weights for the original features β

Page 30: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Dimension Reduction

1. Reduce dimensions of features Z from X

2. Fit prediction model to compute θ

3. Compute weights for the original features β

Page 31: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

How (and Why) Reduce Feaures?

1. Principal Component Analysis (PCA)

2. Partial least squares

3. Also: many other non-linear dimensionality reduction methods

Page 32: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Principal Component Analysis

I Unsupervised dimensionality reduction methodsI Works with n× p data matrixX (no labels)I Correlated features: pop and ad

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad

Sp

en

din

g

Page 33: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

1st Principal Component

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I 1st Principal Component: Direction with the largest variance

Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)

I Is this linear?

Yes, a�er mean centering.

Page 34: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

1st Principal Component

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I 1st Principal Component: Direction with the largest variance

Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)

I Is this linear?

Yes, a�er mean centering.

Page 35: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

1st Principal Component

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I 1st Principal Component: Direction with the largest variance

Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)

I Is this linear? Yes, a�er mean centering.

Page 36: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

1st Principal Component

20 30 40 50

510

15

20

25

30

Population

Ad S

pendin

g

−20 −10 0 10 20

−10

−5

05

10

1st Principal Component2nd P

rincip

al C

om

ponent

green line: 1st principal component, minimize distances to all points

Is this the same as linear regression? No, like total least squares.

Page 37: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

1st Principal Component

20 30 40 50

510

15

20

25

30

Population

Ad S

pendin

g

−20 −10 0 10 20

−10

−5

05

10

1st Principal Component2nd P

rincip

al C

om

ponent

green line: 1st principal component, minimize distances to all points

Is this the same as linear regression?

No, like total least squares.

Page 38: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

1st Principal Component

20 30 40 50

510

15

20

25

30

Population

Ad S

pendin

g

−20 −10 0 10 20

−10

−5

05

10

1st Principal Component2nd P

rincip

al C

om

ponent

green line: 1st principal component, minimize distances to all points

Is this the same as linear regression? No, like total least squares.

Page 39: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

2nd Principal Component

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I 2nd Principal Component: Orthogonal to 1st component,largest variance

Z2 = 0.544× (pop− pop)− 0.839× (ad− ad)

Page 40: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

1st Principal Component

−3 −2 −1 0 1 2 3

20

30

40

50

60

1st Principal Component

Po

pu

latio

n

−3 −2 −1 0 1 2 3

51

01

52

02

53

0

1st Principal Component

Ad

Sp

en

din

g

−1.0 −0.5 0.0 0.5 1.0

20

30

40

50

60

2nd Principal Component

Po

pu

latio

n

−1.0 −0.5 0.0 0.5 1.0

51

01

52

02

53

0

2nd Principal Component

Ad

Sp

en

din

g

Page 41: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Properties of PCA

I No more principal components than features

I Principal components are perpendicular

I Principal components are eigenvalues ofX>X

I Assumes normality, can break with heavy tails

I PCA depends on the scale of features

Page 42: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Principal Component Regression

1. Use PCA to reduce features to a small number of principalcomponents

2. Fit regression using principal components

0 10 20 30 40

010

20

30

40

50

60

70

Number of Components

Me

an

Sq

ua

red

Err

or

0 10 20 30 40

050

100

150

Number of Components

Me

an

Sq

ua

red

Err

or

Squared BiasTest MSEVariance

Page 43: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

PCR vs Ridge Regression & Lasso

0 10 20 30 40

010

20

30

40

50

60

70

PCR

Number of Components

Me

an

Sq

ua

red

Err

or

Squared BiasTest MSEVariance

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

70

Ridge Regression and Lasso

Shrinkage FactorM

ea

n S

qu

are

d E

rro

r

I PCR selects combinations of all features (not feature selection)I PCR is closely related to ridge regression

Page 44: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

PCR Application

2 4 6 8 10

−300

−100

0100

200

300

400

Number of Components

Sta

ndard

ized C

oeffic

ients

IncomeLimitRatingStudent

2 4 6 8 1020000

40000

60000

80000

Number of Components

Cro

ss−

Valid

ation M

SE

Page 45: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Standardizing Features

I Regularization and PCR depend on scales of featuresI Good practice is to standardize features to have same variance

x̃ij =xij√

1n

∑ni=1(xij − x̄j)2

I Do not standardize features when they have the same unitsI PCA needs mean-centered features

x̃ij = xij − x̄j

Page 46: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Partial Least SquaresI Supervised version of PCR

20 30 40 50 60

51

01

52

02

53

0

Population

Ad

Sp

en

din

g

Page 47: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

High-dimensional Data

1. Predict blood pressure from DNA: n = 200, p = 500 000

2. Predicting user behavior online: n = 10 000, p = 200 000

Page 48: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Problem With High Dimensions

I Computational complexityI Overfi�ing is a problem

−1.5 −1.0 −0.5 0.0 0.5 1.0

−10

−5

05

10

−1.5 −1.0 −0.5 0.0 0.5 1.0

−10

−5

05

10

XX

YY

Page 49: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Overfi�ing with Many Variables

5 10 15

0.2

0.4

0.6

0.8

1.0

Number of Variables

R2

5 10 15

0.0

0.2

0.4

0.6

0.8

Number of Variables

Tra

inin

g M

SE

5 10 15

15

50

50

0

Number of Variables

Te

st

MS

E

Page 50: Dimension Reduction Methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · I Dimension reduction methods I Principal component regression I Partial least squares I Interpretation

Interpreting Feature Selection

1. Solutions may not be unique

2. Must be careful about how we report solutions

3. Just because one combination of features predicts well, doesnot mean others will not