final overview - introduction to ml › ~mpetrik › teaching › intro_ml_17 › intro... · this...

61
Final Overview Introduction to ML Marek Petrik 4/25/2017

Upload: others

Post on 30-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Final OverviewIntroduction to ML

Marek Petrik

4/25/2017

Page 2: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

This Course: Introduction to Machine Learning

I Build a foundation for practice and research in MLI Basic machine learning concepts: max likelihood, cross

validationI Fundamental machine learning techniques: regression,

model-selection, deep learningI Educational goals:

1. How to apply basic methods2. Reveal what happens inside3. What are the pitfalls4. Expand understanding of linear algebra, statistics, and

optimization

Page 3: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

What is Machine LearningI Discover unknown function f :

Y = f(X)

I X = set of features, or inputsI Y = target, or response

0 50 100 200 300

51

01

52

02

5

TV

Sa

les

0 10 20 30 40 50

51

01

52

02

5

Radio

Sa

les

0 20 40 60 80 100

51

01

52

02

5Newspaper

Sa

les

Sales = f(TV,Radio,Newspaper)

Page 4: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Statistical View of Machine Learning

I Probability space Ω: Set of all adultsI Random variable: X(ω) = R: Years of educationI Random variable: Y (ω) = R: Salary

10 12 14 16 18 20 22

20

30

40

50

60

70

80

Years of Education

Inco

me

10 12 14 16 18 20 22

20

30

40

50

60

70

80

Years of Education

Inco

me

Page 5: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

How Good are Predictions?

I Learned function fI Test data: (x1, y1), (x2, y2), . . .

I Mean Squared Error (MSE):

MSE =1

n

n∑i=1

(yi − f(xi))2

I This is the estimate of:

MSE = E[(Y − f(X))2] =1

|Ω|∑ω∈Ω

(Y (ω)− f(X(ω)))2

I Important: Samples xi are i.i.d.

Page 6: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

KNN: K-Nearest Neighbors

I Idea: Use similar training points when making predictions

o

o

o

o

o

oo

o

o

o

o

o o

o

o

o

o

oo

o

o

o

o

o

I Non-parametric method (unlike regression)

Page 7: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Bias-Variance Decomposition

Y = f(X) + ε

Mean Squared Error can be decomposed as:

MSE = E(Y − f(X))2 = Var(f(X))︸ ︷︷ ︸Variance

+ (E(f(X)))2︸ ︷︷ ︸Bias

+ Var(ε)

I Bias: How well would method work with infinite dataI Variance: How much does output change with dierent data

sets

Page 8: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

R2 Statistic

R2 = 1− RSS

TSS= 1−

∑ni=1(yi − yi)2∑ni=1(yi − y)2

I RSS - residual sum of squares, TSS - total sum of squaresI R2 measures the goodness of the fit as a proportionI Proportion of data variance explained by the modelI Extreme values:

0: Model does not explain data1: Model explains data perfectly

Page 9: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Correlation Coeicient

I Measures dependence between two random variables X and Y

r =Cov(X,Y )√

Var(X)√

Var(Y )

I Correlation coeicient r is between [−1, 1]

0: Variables are not related1: Variables are perfectly related (same)−1: Variables are negatively related (dierent)

I R2 = r2

Page 10: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Correlation Coeicient

I Measures dependence between two random variables X and Y

r =Cov(X,Y )√

Var(X)√

Var(Y )

I Correlation coeicient r is between [−1, 1]

0: Variables are not related1: Variables are perfectly related (same)−1: Variables are negatively related (dierent)

I R2 = r2

Page 11: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

alitative Features: Many Values The Right Way

I Predict salary as a function of stateI Feature statei ∈ MA,NH,ME

I Introduce 2 indicator variables xi, zi:

xi =

0 if statei = MA

1 if statei 6= MAzi =

0 if statei = NH

1 if statei 6= NH

I Predict salary as:

salary = β0 + β1 × xi + β2 × zi =

β0 + β1 if statei = MA

β0 + β2 if statei = NH

β0 if statei = ME

I Need an indicator variable for ME? Why? hint: linearindependence

Page 12: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

alitative Features: Many Values The Right Way

I Predict salary as a function of stateI Feature statei ∈ MA,NH,MEI Introduce 2 indicator variables xi, zi:

xi =

0 if statei = MA

1 if statei 6= MAzi =

0 if statei = NH

1 if statei 6= NH

I Predict salary as:

salary = β0 + β1 × xi + β2 × zi =

β0 + β1 if statei = MA

β0 + β2 if statei = NH

β0 if statei = ME

I Need an indicator variable for ME? Why? hint: linearindependence

Page 13: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

alitative Features: Many Values The Right Way

I Predict salary as a function of stateI Feature statei ∈ MA,NH,MEI Introduce 2 indicator variables xi, zi:

xi =

0 if statei = MA

1 if statei 6= MAzi =

0 if statei = NH

1 if statei 6= NH

I Predict salary as:

salary = β0 + β1 × xi + β2 × zi =

β0 + β1 if statei = MA

β0 + β2 if statei = NH

β0 if statei = ME

I Need an indicator variable for ME? Why? hint: linearindependence

Page 14: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Outlier Data Points

I Data point that is far away from othersI Measurement failure, sensor fails, missing data pointI Can seriously influence prediction quality

−2 −1 0 1 2

−4

−2

02

46

20

−2 0 2 4 6

−1

01

23

4

Fitted Values

Re

sid

ua

ls20

−2 0 2 4 6

02

46

Fitted Values

Stu

de

ntize

d R

esid

ua

ls

20

X

Y

Page 15: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Points with High Leverage

I Points with unusual value of xiI Single data point can have significant impact on predictionI R and other packages can compute leverages of data points

−2 −1 0 1 2 3 4

05

10

20

41

−2 −1 0 1 2

−2

−1

01

2

0.00 0.05 0.10 0.15 0.20 0.25

−1

01

23

45

Leverage

Stu

de

ntize

d R

esid

ua

ls

20

41

X

Y

X1

X2

I Good to remove points with high leverage and residual

Page 16: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Best Subset Selection

I Want to find a subset of p featuresI The subset should be small and predict wellI Example: credit ∼ rating + income + student + limit

M0 ← null model (no features);for k = 1, 2, . . . , p do

Fit all(pk

)models that contain k features ;

Mk ← best of(pk

)models according to a metric (CV error, R2,

etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above

Algorithm 1: Best Subset Selection

Page 17: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Regularization

I Ridge regression (parameter λ), `2 penalty

minβ

RSS(β) + λ∑j

β2j =

minβ

n∑i=1

yi − β0 −p∑j=1

βjxij

2

+ λ∑j

β2j

I Lasso (parameter λ), `1 penalty

minβ

RSS(β) + λ∑j

|βj | =

minβ

n∑i=1

yi − β0 −p∑j=1

βjxij

2

+ λ∑j

|βj |

I Approximations to the `0 solution

Page 18: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Logistic Regression

I Predict probability of a class: p(X)

I Example: p(balance) probability of default for person withbalance

I Linear regression:

p(X) = β0 + β1

I logistic regression:

p(X) =eβ0+β1X

1 + eβ0+β1X

I the same as:

log

(p(X)

1− p(X)

)= β0 + β1X

I Odds: p(X)/1−p(X)

Page 19: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Logistic Function

y =ex

1 + ex

−10 −5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

x

Logi

stic

p(X) =eβ0+β1X

1 + eβ0+β1X

Page 20: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Logit Function

log

(p(X)

1− p(X)

)

0.0 0.2 0.4 0.6 0.8 1.0

−4

−2

02

4

p(x)

Logi

t

log

(p(X)

1− p(X)

)= β0 + β1X

Page 21: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Estimating Coeicients: Maximum Likelihood

I Likelihood: Probability that data is generated from a model

`(model) = Pr[data | model]

I Find the most likely model:

maxmodel

`(model) = maxmodel

Pr[data | model]

I Likelihood function is diicult to maximizeI Transform it using log (strictly increasing)

maxmodel

log `(model)

I Strictly increasing transformation does not change maximum

Page 22: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Discriminative vs Generative Models

I Discriminative modelsI Estimate conditional models Pr[Y | X]I Linear regressionI Logistic regression

I Generative modelsI Estimate joint probability Pr[Y,X] = Pr[Y | X] Pr[X]I Estimates not only probability of labels but also the featuresI Once model is fit, can be used to generate dataI LDA, QDA, Naive Bayes

Page 23: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

LDA: Linear Discriminant Analysis

I Generative model: capture probability of predictors for eachlabel

−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4

01

23

45

I Predict:

1. Pr[balance | default = yes] and Pr[default = yes]2. Pr[balance | default = no] and Pr[default = no]

I Classes are normal: Pr[balance | default = yes]

Page 24: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

LDA: Linear Discriminant Analysis

I Generative model: capture probability of predictors for eachlabel

−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4

01

23

45

I Predict:1. Pr[balance | default = yes] and Pr[default = yes]

2. Pr[balance | default = no] and Pr[default = no]

I Classes are normal: Pr[balance | default = yes]

Page 25: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

LDA: Linear Discriminant Analysis

I Generative model: capture probability of predictors for eachlabel

−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4

01

23

45

I Predict:1. Pr[balance | default = yes] and Pr[default = yes]2. Pr[balance | default = no] and Pr[default = no]

I Classes are normal: Pr[balance | default = yes]

Page 26: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

LDA: Linear Discriminant Analysis

I Generative model: capture probability of predictors for eachlabel

−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4

01

23

45

I Predict:1. Pr[balance | default = yes] and Pr[default = yes]2. Pr[balance | default = no] and Pr[default = no]

I Classes are normal: Pr[balance | default = yes]

Page 27: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

QDA:adratic Discriminant Analysis

I Generalizes LDA

I LDA: Class variances Σk = Σ are the sameI QDA: Class variances Σk can dier

I LDA or QDA has smaller training error on the same data?I What about the test error?

Page 28: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

QDA:adratic Discriminant Analysis

I Generalizes LDA

I LDA: Class variances Σk = Σ are the sameI QDA: Class variances Σk can dier

I LDA or QDA has smaller training error on the same data?

I What about the test error?

Page 29: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

QDA:adratic Discriminant Analysis

I Generalizes LDA

I LDA: Class variances Σk = Σ are the sameI QDA: Class variances Σk can dier

I LDA or QDA has smaller training error on the same data?I What about the test error?

Page 30: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

QDA:adratic Discriminant Analysis

−4 −2 0 2 4

−4

−3

−2

−1

01

2

−4 −2 0 2 4−

4−

3−

2−

10

12

X1X1

X2

X2

Page 31: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Naive Bayes

I Simple Bayes net classification

I With normal distribution over features X1, . . . , Xk special caseof QDA with diagonal Σ

I Generalizes to non-Normal distributions and discrete variablesI More on it later . . .

Page 32: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Maximum Margin Hyperplane

−1 0 1 2 3

−1

01

23

X1

X2

Page 33: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Introducing Slack Variables

I Maximum margin classifier

maxβ,M

M

s.t. yi(β>xi) ≥M

‖β‖2 = 1

I Support Vector Classifier a.k.a Linear SVM

maxβ,M,ε≥0

M

s.t. yi(β>xi) ≥ (M − εi)

‖β‖2 = 1

‖ε‖1 ≤ C

I Slack variables: εI Parameter: C

What if C = 0?

Page 34: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Introducing Slack Variables

I Maximum margin classifier

maxβ,M

M

s.t. yi(β>xi) ≥M

‖β‖2 = 1

I Support Vector Classifier a.k.a Linear SVM

maxβ,M,ε≥0

M

s.t. yi(β>xi) ≥ (M − εi)

‖β‖2 = 1

‖ε‖1 ≤ C

I Slack variables: εI Parameter: C What if C = 0?

Page 35: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Kernelized SVM

I Dualadratic Program (usually max-min, not here)

maxα≥0

M∑l=1

αl −1

2

M∑j,k=1

αjαkyjykk(xj , xk)

s.t.M∑l=1

αlyl = 0

I Representer theorem: (classification test):

f(z) =

M∑l=1

αlylk(z, xl) > 0

Page 36: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Kernels

I Polynomial kernel

k(x1, x2) =(

1 + x>1 x2

)dI Radial kernel

k(x1, x2) = exp(−γ‖x1 − x2‖22

)I Many many more

Page 37: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Polynomial and Radial Kernels

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4−

4−

20

24

X1X1

X2

X2

Page 38: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Regression Trees

I Predict Baseball Salary based on Years played and Hits

I Example:

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

Page 39: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

CART: Recursive Binary Spliing

I Greedy top-to-boom approach

I Recursively divide regions to minimize RSS∑xi∈R1

(yi − y1)2 +∑xi∈R2

(yi − y2)2

I Prune tree

Page 40: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Trees vs. KNN

I Trees do not require a distance metricI Trees work well with categorical predictorsI Trees work well in large dimensionsI KNN are beer in low-dimensional problems with complex

decision boundaries

Page 41: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Bagging

I Stands for “Bootstrap Aggregating”I Construct multiple bootstrapped training sets:

T1, T2, . . . , TB

I Fit a tree to each one:

f1, f2, . . . , fB

I Make predictions by averaging individual tree predictions

f(x) =1

B

B∑b=1

fb(x)

I Large values of B are not likely to overfit, B ≈ 100 is a goodchoice

Page 42: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Random Forests

I Many trees in bagging will be similarI Algorithms choose the same features to split onI Random forests help to address similarity:

I At each split, choose only fromm randomly sampled features

I Good empirical choice ism =√p

0 100 200 300 400 500

0.2

0.3

0.4

0.5

Number of Trees

Test C

lassific

ation E

rror

m=p

m=p/2

m= p

Page 43: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Gradient Boosting (Regression)

I Boosting uses all of data, not a random subset (usually)I Also builds trees f1, f2, . . .

I and weights λ1, λ2, . . .

I Combined prediction:

f(x) =∑i

λifi(x)

I Assume we have 1 . . .m trees and weights, next best tree?

Page 44: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Gradient Boosting (Regression)

I Just use gradient descentI Objective is to minimize RSS (1/2):

1

2

n∑i=1

(yi − f(xi))2

I Objective with the new treem+ 1:

1

2

n∑i=1

yi − m∑j=1

fj(xi)− fm+1(xi)

2

I Greatest reduction in RSS: gradient

yi −m∑j=1

fj(xi) ≈ fm+1(xi)

Page 45: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

ROC Curve

Confusion matrixReality

Positive Negative

Predicted Positive True Positive False PositiveNegative False Negative True Negative

ROC Curve

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Page 46: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Area Under ROC Curve

ROC Curve

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

I Larger area is beerI Many other ways to measure classifier performance, like F1

Page 47: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Evaluation Method 1: Validation Set

I Just evaluate how well the method works on the test setI Randomly split data to:

1. Training set: about half of all data2. Validation set (AKA hold-out set): remaining half

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

%!!""!! #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!& !

I Choose the number of features/representation based onminimizing error on validation set

Page 48: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Evaluation Method 1: Validation Set

I Just evaluate how well the method works on the test setI Randomly split data to:

1. Training set: about half of all data2. Validation set (AKA hold-out set): remaining half

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

%!!""!! #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!& !

I Choose the number of features/representation based onminimizing error on validation set

Page 49: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Evaluation Method 2: Leave-one-out

I Addresses problems with validation setI Split the data set into 2 parts:

1. Training: Size n− 12. Validation: Size 1

I Repeat n times: to get n learning problems

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

%!

%!

%!

Page 50: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Evaluation Method 3: k-fold Cross-validationI Hybrid between validation set and LOOI Split training set into k subsets

1. Training set: n− n/k2. Test set: n/k

I k learning problems

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

I Cross-validation error:

CV(k) =1

k

k∑i=1

MSEi

Page 51: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Bootstrap

I Goal: Understand the confidence in learned parametersI Most useful in inferenceI How confident are we in learned values of β:

mpg = β0 + β1 power

I Approach: Run learning algorithm multiple times withdierent data sets:

I Create a new data-set by sampling with replacement fromthe original one

Page 52: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Bootstrap

I Goal: Understand the confidence in learned parametersI Most useful in inferenceI How confident are we in learned values of β:

mpg = β0 + β1 power

I Approach: Run learning algorithm multiple times withdierent data sets:

I Create a new data-set by sampling with replacement fromthe original one

Page 53: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Bootstrap

I Goal: Understand the confidence in learned parametersI Most useful in inferenceI How confident are we in learned values of β:

mpg = β0 + β1 power

I Approach: Run learning algorithm multiple times withdierent data sets:

I Create a new data-set by sampling with replacement fromthe original one

Page 54: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Principal Component Analysis

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I 1st Principal Component: Direction with the largest variance

Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)

I Is this linear?

Yes, aer mean centering.

Page 55: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Principal Component Analysis

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I 1st Principal Component: Direction with the largest variance

Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)

I Is this linear?

Yes, aer mean centering.

Page 56: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Principal Component Analysis

10 20 30 40 50 60 70

05

10

15

20

25

30

35

Population

Ad S

pendin

g

I 1st Principal Component: Direction with the largest variance

Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)

I Is this linear? Yes, aer mean centering.

Page 57: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

K-Means AlgorithmHeuristic solution to the minimization problem

1. Randomly assign cluster numbers to observations2. Iterate while clusters change

2.1 For each cluster, compute the centroid2.2 Assign each observation to the closest cluster

Note that:

1

|Ck|∑

i,i′∈Ck

p∑j=1

(xij − xi′j)2 = 2∑

i,i′∈Ck

p∑j=1

(xij − xkj)2

Page 58: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

K-Means Illustration

Data Step 1 Iteration 1, Step 2a

Iteration 1, Step 2b Iteration 2, Step 2a Final Results

Page 59: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

Dendrogram: Similarity Tree0

24

68

10

02

46

81

0

02

46

81

0

Page 60: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

What Next?

I This course: How to use machine learning. More tools:I Gaussian processesI Time series modelsI Domain specific models (e.g., natural language processing)

I Doing ML ResearchI Musts: Linear algebra, statistics, convex optimizationI Important: Probably Approximately Correct Learning

Page 61: Final Overview - Introduction to ML › ~mpetrik › teaching › intro_ml_17 › intro... · This Course: Introduction to Machine Learning I Build a foundation for practice and research

What Next?

I This course: How to use machine learning. More tools:I Gaussian processesI Time series modelsI Domain specific models (e.g., natural language processing)

I Doing ML ResearchI Musts: Linear algebra, statistics, convex optimizationI Important: Probably Approximately Correct Learning