dimension reduction methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · i...

Dimension Reduction MethodsAnd Bayesian Machine Learning

Marek Petrik

2/28

Previously in Machine Learning

I How to choose the right features if we have (too) many options

I Methods:1. Subset selection2. Regularization (shrinkage)3. Dimensionality reduction (next class)

Best Subset Selection

I Want to find a subset of p featuresI The subset should be small and predict wellI Example: credit ∼ rating + income + student + limit

M0 ← null model (no features);for k = 1, 2, . . . , p do

Fit all(pk

)models that contain k features ;

Mk ← best of(pk

)models according to a metric (CV error, R2,

etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above

Algorithm 1: Best Subset Selection

Achieving Scalability

I Complexity of Best Subset Selection?

I Examine all possible subsets? How many?I O(2p)!

I Heuristic approaches:1. Stepwise selection: Solve the problem approximately: greedy2. Regularization: Solve a di�erent (easier) problem: relaxation


I Complexity of Best Subset Selection?I Examine all possible subsets? How many?

I O(2p)!



I Complexity of Best Subset Selection?I Examine all possible subsets? How many?I O(2p)!


Which Metric to Use?M0 ← null model (no features);for k = 1, 2, . . . , p do

Fit all(pk


Mk ← best of(pk




1. Direct error estimate: Cross validation, precise butcomputationally intensive

2. Indirect error estimate: Mellow’s Cp:

Cp =1

n(RSS +2dσ̂2) where σ̂2 ≈ Var[ε]

Akaike information criterion, BIC, and many others.Theoretical foundations

3. Interpretability Penalty: What is the cost of extra features


Fit all(pk


Mk ← best of(pk






Cp =1




Regularization

1. Stepwise selection: Solve the problem approximately

2. Regularization: Solve a di�erent (easier) problem: relaxationI Solve a machine learning problem, but penalize solutions that

use “too much” of the features

Regularization

I Ridge regression (parameter λ), `2 penalty

minβ

RSS(β) + λ∑j

β2j =

minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

+ λ∑j

β2j

I Lasso (parameter λ), `1 penalty

minβ

RSS(β) + λ∑j

|βj | =

minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

+ λ∑j

|βj |

I Approximations to the `0 solution

Why Lasso WorksI Bias-variance trade-o�I Increasing λ increases biasI Example: all features relevant

0.02 0.10 0.50 2.00 10.00 50.00

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

purple: test MSE, black: bias, green: variancedo�ed (ridge)

Why Lasso WorksI Bias-variance trade-o�I Increasing λ increases biasI Example: some features relevant

0.02 0.10 0.50 2.00 10.00 50.00

020

40

60

80

100

Me

an

Sq

ua

red

Err

or

0.4 0.5 0.6 0.7 0.8 0.9 1.0

020

40

60

80

100

R2 on Training Data

Me

an

Sq

ua

red

Err

or

λ

purple: test MSE, black: bias, green: variancedo�ed (ridge)

Regularization


minβ

RSS(β) + λ∑j

β2j =

minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

+ λ∑j

β2j


minβ

RSS(β) + λ∑j

|βj | =

minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

+ λ∑j

|βj |


Regularization: Constrained Formulation


minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

subject to∑j

β2j ≤ s


minβ

n∑i=1

yi − β0 − p∑j=1

βjxij

2

subject to∑j

|βj | ≤ s


Lasso Solutions are Sparse

Constrained Lasso (le�) vs Constrained Ridge Regression (right)

Constraints are blue, red are contours of the objective

Today

I Dimension reduction methodsI Principal component regressionI Partial least squares

I Interpretation in high dimensionsI Bayesian view of ridge regression and lasso

Dimensionality Reduction Methods

I Di�erent approach to model selectionI We have many features: X1, X2, . . . , Xp

I Transform features to a smaller number Z1, . . . , ZM

I Find constants φjmI New features Zm are linear combinations of Xj :

Zm =

p∑j=1

φjmXj

I Dimension reduction: M is much smaller than p

Using Transformed Features

I New features Zm are linear combinations of Xj :

Zm =

p∑j=1

φjmXj

I Fit linear regression model:

yi = θ0 +

M∑m=1

θmzim + εi

I Run plain linear regression, logistic regression, LDA, oranything else

Recovering Coe�icients for Original Features

I Prediction using transformed features

yi = θ0 +

M∑m=1

θmzim + εi


Zm =

p∑j=1

φjmXj

I Consider prediction for data point i

M∑m=1

θmzim

=

M∑m=1

θm

p∑j=1

φjmxij =

p∑j=1

M∑m=1

θmφjmxij =

p∑j=1

βjxij



yi = θ0 +

M∑m=1

θmzim + εi


Zm =

p∑j=1

φjmXj


M∑m=1

θmzim =

M∑m=1

θm

p∑j=1

φjmxij

=

p∑j=1

M∑m=1

θmφjmxij =

p∑j=1

βjxij



yi = θ0 +

M∑m=1

θmzim + εi


Zm =

p∑j=1

φjmXj


M∑m=1

θmzim =

M∑m=1

θm

p∑j=1

φjmxij =

p∑j=1

M∑m=1

θmφjmxij

=

p∑j=1

βjxij



yi = θ0 +

M∑m=1

θmzim + εi


Zm =

p∑j=1

φjmXj


M∑m=1

θmzim =

M∑m=1

θm

p∑j=1

φjmxij =

p∑j=1

M∑m=1

θmφjmxij =

p∑j=1

βjxij

Dimension Reduction

1. Reduce dimensions of features Z from X

2. Fit prediction model to compute θ

3. Compute weights for the original features β

How (and Why) Reduce Feaures?

1. Principal Component Analysis (PCA)

2. Partial least squares

3. Also: many other non-linear dimensionality reduction methods

Principal Component Analysis

I Unsupervised dimensionality reduction methodsI Works with n× p data matrixX (no labels)I Correlated features: pop and ad

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad

Sp

en

din

g

1st Principal Component

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I 1st Principal Component: Direction with the largest variance

Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)

I Is this linear?

Yes, a�er mean centering.


10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I 1st Principal Component: Direction with the largest variance

Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)

I Is this linear? Yes, a�er mean centering.


20 30 40 50

510

15

20

25

30

Population

Ad S

pendin

g

−20 −10 0 10 20

−10

−5

05

10

1st Principal Component2nd P

rincip

al C

om

ponent

green line: 1st principal component, minimize distances to all points

Is this the same as linear regression? No, like total least squares.


20 30 40 50

510

15

20

25

30

Population

Ad S

pendin

g

−20 −10 0 10 20

−10

−5

05

10


rincip

al C

om

ponent


Is this the same as linear regression?

No, like total least squares.


20 30 40 50

510

15

20

25

30

Population

Ad S

pendin

g

−20 −10 0 10 20

−10

−5

05

10


rincip

al C

om

ponent


Is this the same as linear regression? No, like total least squares.

2nd Principal Component

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I 2nd Principal Component: Orthogonal to 1st component,largest variance

Z2 = 0.544× (pop− pop)− 0.839× (ad− ad)


−3 −2 −1 0 1 2 3

20

30

40

50

60


Po

pu

latio

n

−3 −2 −1 0 1 2 3

51

01

52

02

53

0


Ad

Sp

en

din

g

−1.0 −0.5 0.0 0.5 1.0

20

30

40

50

60


Po

pu

latio

n

−1.0 −0.5 0.0 0.5 1.0

51

01

52

02

53

0


Ad

Sp

en

din

g

Properties of PCA

I No more principal components than features

I Principal components are perpendicular

I Principal components are eigenvalues ofX>X

I Assumes normality, can break with heavy tails

I PCA depends on the scale of features

Principal Component Regression

1. Use PCA to reduce features to a small number of principalcomponents

2. Fit regression using principal components

0 10 20 30 40

010

20

30

40

50

60

70

Number of Components

Me

an

Sq

ua

red

Err

or

0 10 20 30 40

050

100

150


Me

an

Sq

ua

red

Err

or

Squared BiasTest MSEVariance

PCR vs Ridge Regression & Lasso

0 10 20 30 40

010

20

30

40

50

60

70

PCR


Me

an

Sq

ua

red

Err

or

Squared BiasTest MSEVariance

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

70

Ridge Regression and Lasso

Shrinkage FactorM

ea

n S

qu

are

d E

rro

r

I PCR selects combinations of all features (not feature selection)I PCR is closely related to ridge regression

PCR Application

2 4 6 8 10

−300

−100

0100

200

300

400


Sta

ndard

ized C

oeffic

ients

IncomeLimitRatingStudent

2 4 6 8 1020000

40000

60000

80000


Cro

ss−

Valid

ation M

SE

Standardizing Features

I Regularization and PCR depend on scales of featuresI Good practice is to standardize features to have same variance

x̃ij =xij√

1n

∑ni=1(xij − x̄j)2

I Do not standardize features when they have the same unitsI PCA needs mean-centered features

x̃ij = xij − x̄j

Partial Least SquaresI Supervised version of PCR

20 30 40 50 60

51

01

52

02

53

0

Population

Ad

Sp

en

din

g

High-dimensional Data

1. Predict blood pressure from DNA: n = 200, p = 500 000

2. Predicting user behavior online: n = 10 000, p = 200 000

Problem With High Dimensions

I Computational complexityI Overfi�ing is a problem

−1.5 −1.0 −0.5 0.0 0.5 1.0

−10

−5

05

10

−1.5 −1.0 −0.5 0.0 0.5 1.0

−10

−5

05

10

XX

YY

Overfi�ing with Many Variables

5 10 15

0.2

0.4

0.6

0.8

1.0

Number of Variables

R2

5 10 15

0.0

0.2

0.4

0.6

0.8

Number of Variables

Tra

inin

g M

SE

5 10 15

15

50

50

0

Number of Variables

Te

st

MS

E

Interpreting Feature Selection

1. Solutions may not be unique

2. Must be careful about how we report solutions

3. Just because one combination of features predicts well, doesnot mean others will not

dimension reduction methodsmpetrik/teaching/intro_ml_17/intro_ml_17... · 2019-10-29 · i...

Documents