linear regression (continued)atalwalk/teaching/winter17/... · 1d example: predicting the sale...

105
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39

Upload: others

Post on 03-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Linear Regression (continued)

Professor Ameet Talwalkar

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39

Page 2: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Outline

1 Administration

2 Review of last lecture

3 Linear regression

4 Nonlinear basis functions

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 2 / 39

Page 3: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Announcements

HW2 will be returned in section on Friday

HW3 due in class next Monday

Midterm is next Wednesday (will review in more detail next class)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 3 / 39

Page 4: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Outline

1 Administration

2 Review of last lecturePerceptronLinear regression introduction

3 Linear regression

4 Nonlinear basis functions

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 4 / 39

Page 5: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Perceptron Main idea

Consider a linear model for binary classification

wTx

We use this model to distinguish between two classes {−1,+1}.

One goal

ε =∑

n

I[yn 6= sign(wTxn)]

i.e., to minimize errors on the training dataset.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 5 / 39

Page 6: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Hard, but easy if we have only one training example

How can we change w such that

yn = sign(wTxn)

Two cases

If yn = sign(wTxn), do nothing.

If yn 6= sign(wTxn), wnew ← wold + ynxn

I Gauranteed to make progress, i.e., to get us closer to y(w>x) > 0

What does update do?

yn[(w + ynxn)Txn] = ynw

Txn + y2nxTnxn

We are adding a positive number, so it’s possible that yn(wnewTxn) > 0

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39

Page 7: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Hard, but easy if we have only one training example

How can we change w such that

yn = sign(wTxn)

Two cases

If yn = sign(wTxn), do nothing.

If yn 6= sign(wTxn), wnew ← wold + ynxn

I Gauranteed to make progress, i.e., to get us closer to y(w>x) > 0

What does update do?

yn[(w + ynxn)Txn] = ynw

Txn + y2nxTnxn

We are adding a positive number, so it’s possible that yn(wnewTxn) > 0

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39

Page 8: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Hard, but easy if we have only one training example

How can we change w such that

yn = sign(wTxn)

Two cases

If yn = sign(wTxn), do nothing.

If yn 6= sign(wTxn), wnew ← wold + ynxn

I Gauranteed to make progress, i.e., to get us closer to y(w>x) > 0

What does update do?

yn[(w + ynxn)Txn] = ynw

Txn + y2nxTnxn

We are adding a positive number, so it’s possible that yn(wnewTxn) > 0

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39

Page 9: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Perceptron algorithm

Iteratively solving one case at a time

REPEAT

Pick a data point xn (can be a fixed order of the training instances)

Make a prediction y = sign(wTxn) using the current w

If y = yn, do nothing. Else,

w ← w + ynxn

UNTIL converged.

Properties

This is an online algorithm.

If the training data is linearly separable, the algorithm stops in a finitenumber of steps (we proved this).

The parameter vector is always a linear combination of traininginstances (requires initialization of w0 = 0).

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 7 / 39

Page 10: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Perceptron algorithm

Iteratively solving one case at a time

REPEAT

Pick a data point xn (can be a fixed order of the training instances)

Make a prediction y = sign(wTxn) using the current w

If y = yn, do nothing. Else,

w ← w + ynxn

UNTIL converged.

Properties

This is an online algorithm.

If the training data is linearly separable, the algorithm stops in a finitenumber of steps (we proved this).

The parameter vector is always a linear combination of traininginstances (requires initialization of w0 = 0).

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 7 / 39

Page 11: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Regression

Predicting a continuous outcome variable

Predicting shoe size from height, weight and gender

Predicting song year from audio features

Key difference from classification

We can measure ’closeness’ of prediction and labelsI Predicting shoe size: better to be off by one size than by 5 sizesI Predicting song year: better to be off by one year than by 20 years

As opposed to 0-1 classification error, we will focus on squareddifference, i.e., (y − y)2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39

Page 12: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Regression

Predicting a continuous outcome variable

Predicting shoe size from height, weight and gender

Predicting song year from audio features

Key difference from classification

We can measure ’closeness’ of prediction and labelsI Predicting shoe size: better to be off by one size than by 5 sizesI Predicting song year: better to be off by one year than by 20 years

As opposed to 0-1 classification error, we will focus on squareddifference, i.e., (y − y)2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39

Page 13: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Regression

Predicting a continuous outcome variable

Predicting shoe size from height, weight and gender

Predicting song year from audio features

Key difference from classification

We can measure ’closeness’ of prediction and labelsI Predicting shoe size: better to be off by one size than by 5 sizesI Predicting song year: better to be off by one year than by 20 years

As opposed to 0-1 classification error, we will focus on squareddifference, i.e., (y − y)2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39

Page 14: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

1D example: predicting the sale price of a house

Sale price ≈ price per sqft × square footage + fixed expense

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 9 / 39

Page 15: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Minimize squared errors

Our modelSale price = price per sqft × square footage + fixed expense +unexplainable stuffTraining data

sqft sale price prediction error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·

AimAdjust price per sqft and fixed expense such that the sum of the squarederror is minimized — i.e., unexplainable stuff is minimized.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 10 / 39

Page 16: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Minimize squared errors

Our modelSale price = price per sqft × square footage + fixed expense +unexplainable stuffTraining data

sqft sale price prediction error squared error

2000 810K 720K 90K 8100

2100 907K 800K 107K 1072

1100 312K 350K 38K 382

5500 2,600K 2,600K 0 0

· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·

AimAdjust price per sqft and fixed expense such that the sum of the squarederror is minimized — i.e., unexplainable stuff is minimized.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 10 / 39

Page 17: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Linear regression

Setup

Input: x ∈ RD (covariates, predictors, features, etc)

Output: y ∈ R (responses, targets, outcomes, outputs, etc)

Model: f : x→ y, with f(x) = w0 +∑

dwdxd = w0 +wTx

I w = [w1 w2 · · · wD]T: weights, parameters, or parameter vector

I w0 is called biasI We also sometimes call w = [w0 w1 w2 · · · wD]

T parameters too

Training data: D = {(xn, yn), n = 1, 2, . . . ,N}Least Mean Squares (LMS) Objective: Minimize squared difference ontraining data (or residual sum of squares)

RSS(w) =∑

n

[yn − f(xn)]2 =

n

[yn − (w0 +∑

d

wdxnd)]2

1D Solution: Identify stationary points by taking derivative with respectto parameters and setting to zero, yielding ‘normal equations’

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Page 18: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Linear regression

Setup

Input: x ∈ RD (covariates, predictors, features, etc)

Output: y ∈ R (responses, targets, outcomes, outputs, etc)

Model: f : x→ y, with f(x) = w0 +∑

dwdxd = w0 +wTx

I w = [w1 w2 · · · wD]T: weights, parameters, or parameter vector

I w0 is called biasI We also sometimes call w = [w0 w1 w2 · · · wD]

T parameters too

Training data: D = {(xn, yn), n = 1, 2, . . . ,N}Least Mean Squares (LMS) Objective: Minimize squared difference ontraining data (or residual sum of squares)

RSS(w) =∑

n

[yn − f(xn)]2 =

n

[yn − (w0 +∑

d

wdxnd)]2

1D Solution: Identify stationary points by taking derivative with respectto parameters and setting to zero, yielding ‘normal equations’

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Page 19: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Linear regression

Setup

Input: x ∈ RD (covariates, predictors, features, etc)

Output: y ∈ R (responses, targets, outcomes, outputs, etc)

Model: f : x→ y, with f(x) = w0 +∑

dwdxd = w0 +wTx

I w = [w1 w2 · · · wD]T: weights, parameters, or parameter vector

I w0 is called biasI We also sometimes call w = [w0 w1 w2 · · · wD]

T parameters too

Training data: D = {(xn, yn), n = 1, 2, . . . ,N}

Least Mean Squares (LMS) Objective: Minimize squared difference ontraining data (or residual sum of squares)

RSS(w) =∑

n

[yn − f(xn)]2 =

n

[yn − (w0 +∑

d

wdxnd)]2

1D Solution: Identify stationary points by taking derivative with respectto parameters and setting to zero, yielding ‘normal equations’

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Page 20: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Linear regression

Setup

Input: x ∈ RD (covariates, predictors, features, etc)

Output: y ∈ R (responses, targets, outcomes, outputs, etc)

Model: f : x→ y, with f(x) = w0 +∑

dwdxd = w0 +wTx

I w = [w1 w2 · · · wD]T: weights, parameters, or parameter vector

I w0 is called biasI We also sometimes call w = [w0 w1 w2 · · · wD]

T parameters too

Training data: D = {(xn, yn), n = 1, 2, . . . ,N}Least Mean Squares (LMS) Objective: Minimize squared difference ontraining data (or residual sum of squares)

RSS(w) =∑

n

[yn − f(xn)]2 =

n

[yn − (w0 +∑

d

wdxnd)]2

1D Solution: Identify stationary points by taking derivative with respectto parameters and setting to zero, yielding ‘normal equations’

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Page 21: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Linear regression

Setup

Input: x ∈ RD (covariates, predictors, features, etc)

Output: y ∈ R (responses, targets, outcomes, outputs, etc)

Model: f : x→ y, with f(x) = w0 +∑

dwdxd = w0 +wTx

I w = [w1 w2 · · · wD]T: weights, parameters, or parameter vector

I w0 is called biasI We also sometimes call w = [w0 w1 w2 · · · wD]

T parameters too

Training data: D = {(xn, yn), n = 1, 2, . . . ,N}Least Mean Squares (LMS) Objective: Minimize squared difference ontraining data (or residual sum of squares)

RSS(w) =∑

n

[yn − f(xn)]2 =

n

[yn − (w0 +∑

d

wdxnd)]2

1D Solution: Identify stationary points by taking derivative with respectto parameters and setting to zero, yielding ‘normal equations’

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Page 22: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Probabilistic interpretation

Noisy observation model

Y = w0 + w1X + η

where η ∼ N(0, σ2) is a Gaussian random variable

Likelihood of one training sample (xn, yn)

p(yn|xn) = N(w0 + w1xn, σ2) =

1√2πσ

e−[yn−(w0+w1xn)]2

2σ2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 12 / 39

Page 23: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Probabilistic interpretation

Noisy observation model

Y = w0 + w1X + η

where η ∼ N(0, σ2) is a Gaussian random variable

Likelihood of one training sample (xn, yn)

p(yn|xn) = N(w0 + w1xn, σ2) =

1√2πσ

e−[yn−(w0+w1xn)]2

2σ2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 12 / 39

Page 24: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Maximum likelihood estimation

Maximize over w0 and w1

max logP (D)⇔ min∑

n

[yn − (w0 + w1xn)]2← That is RSS(w)!

Maximize over s = σ2

∂ logP (D)∂s

= −1

2

{− 1

s2

n

[yn − (w0 + w1xn)]2 + N

1

s

}= 0

→ σ∗2 = s∗ =1

N

n

[yn − (w0 + w1xn)]2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 13 / 39

Page 25: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Maximum likelihood estimation

Maximize over w0 and w1

max logP (D)⇔ min∑

n

[yn − (w0 + w1xn)]2← That is RSS(w)!

Maximize over s = σ2

∂ logP (D)∂s

= −1

2

{− 1

s2

n

[yn − (w0 + w1xn)]2 + N

1

s

}= 0

→ σ∗2 = s∗ =1

N

n

[yn − (w0 + w1xn)]2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 13 / 39

Page 26: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

How does this probabilistic interpretation help us?

It gives a solid footing to our intuition: minimizing RSS(w) is asensible thing based on reasonable modeling assumptions

Estimating σ∗ tells us how much noise there could be in ourpredictions. For example, it allows us to place confidence intervalsaround our predictions.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 14 / 39

Page 27: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

How does this probabilistic interpretation help us?

It gives a solid footing to our intuition: minimizing RSS(w) is asensible thing based on reasonable modeling assumptions

Estimating σ∗ tells us how much noise there could be in ourpredictions. For example, it allows us to place confidence intervalsaround our predictions.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 14 / 39

Page 28: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Outline

1 Administration

2 Review of last lecture

3 Linear regressionMultivariate solution in matrix formComputational and numerical optimizationRidge regression

4 Nonlinear basis functions

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 15 / 39

Page 29: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

LMS when x is D-dimensionalRSS(w) in matrix form

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2

=∑

n

[yn − wTxn]2

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD]T, w ← [w0 w1 w2 . . . wD]

T

which leads to

RSS(w) =∑

n

(yn − wTxn)(yn − xTn w)

=∑

n

wTxnxTn w − 2ynx

Tn w + const.

=

{wT

(∑

n

xnxTn

)w − 2

(∑

n

ynxTn

)w

}+ const.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

Page 30: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

LMS when x is D-dimensionalRSS(w) in matrix form

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2 =

n

[yn − wTxn]2

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD]T, w ← [w0 w1 w2 . . . wD]

T

which leads to

RSS(w) =∑

n

(yn − wTxn)(yn − xTn w)

=∑

n

wTxnxTn w − 2ynx

Tn w + const.

=

{wT

(∑

n

xnxTn

)w − 2

(∑

n

ynxTn

)w

}+ const.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

Page 31: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

LMS when x is D-dimensionalRSS(w) in matrix form

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2 =

n

[yn − wTxn]2

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD]T, w ← [w0 w1 w2 . . . wD]

T

which leads to

RSS(w) =∑

n

(yn − wTxn)(yn − xTn w)

=∑

n

wTxnxTn w − 2ynx

Tn w + const.

=

{wT

(∑

n

xnxTn

)w − 2

(∑

n

ynxTn

)w

}+ const.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

Page 32: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

LMS when x is D-dimensionalRSS(w) in matrix form

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2 =

n

[yn − wTxn]2

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD]T, w ← [w0 w1 w2 . . . wD]

T

which leads to

RSS(w) =∑

n

(yn − wTxn)(yn − xTn w)

=∑

n

wTxnxTn w − 2ynx

Tn w + const.

=

{wT

(∑

n

xnxTn

)w − 2

(∑

n

ynxTn

)w

}+ const.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

Page 33: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

LMS when x is D-dimensionalRSS(w) in matrix form

RSS(w) =∑

n

[yn − (w0 +∑

d

wdxnd)]2 =

n

[yn − wTxn]2

where we have redefined some variables (by augmenting)

x← [1 x1 x2 . . . xD]T, w ← [w0 w1 w2 . . . wD]

T

which leads to

RSS(w) =∑

n

(yn − wTxn)(yn − xTn w)

=∑

n

wTxnxTn w − 2ynx

Tn w + const.

=

{wT

(∑

n

xnxTn

)w − 2

(∑

n

ynxTn

)w

}+ const.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

Page 34: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Inner Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

Each entry of output matrix is result of inner product of inputs matrices

Page 35: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Inner Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

Each entry of output matrix is result of inner product of inputs matrices

Page 36: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Inner Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

9� 1+ 3� 3+ 5� 2 = 28

Each entry of output matrix is result of inner product of inputs matrices

Page 37: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Inner Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

9� 1+ 3� 3+ 5� 2 = 28

Each entry of output matrix is result of inner product of inputs matrices

Page 38: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Inner Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

Each entry of output matrix is result of inner product of inputs matrices

Page 39: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Inner Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

Each entry of output matrix is result of inner product of inputs matrices

Page 40: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Inner Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

Each entry of output matrix is result of inner product of inputs matrices

Page 41: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Outer Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

�9 184 8

�+

�9 �153 �5

�+

�10 154 6

Output matrix is sum of outer products between corresponding rows and columns of input matrices

Page 42: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Outer Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

�9 184 8

�+

�9 �153 �5

�+

�10 154 6

Output matrix is sum of outer products between corresponding rows and columns of input matrices

Page 43: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Outer Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

�9 184 8

�+

�9 �153 �5

�+

�10 154 6

Output matrix is sum of outer products between corresponding rows and columns of input matrices

Page 44: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Outer Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

�9 184 8

�+

�9 �153 �5

�+

�10 154 6

Output matrix is sum of outer products between corresponding rows and columns of input matrices

Page 45: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Outer Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

�9 184 8

�+

�9 �153 �5

�+

�10 154 6

Output matrix is sum of outer products between corresponding rows and columns of input matrices

Page 46: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Outer Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

�9 184 8

�+

�9 �153 �5

�+

�10 154 6

Output matrix is sum of outer products between corresponding rows and columns of input matrices

Page 47: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Outer Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

�9 184 8

�+

�9 �153 �5

�+

�10 154 6

Output matrix is sum of outer products between corresponding rows and columns of input matrices

Page 48: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Matrix Multiplication via Outer Products

�9 3 54 1 2

���1 23 �52 3

�� =

�28 1811 9

�9 184 8

�+

�9 �153 �5

�+

�10 154 6

Output matrix is sum of outer products between corresponding rows and columns of input matrices

Page 49: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

RSS(w) in new notationsFrom previous slide

RSS(w) =

{wT

(∑

n

xnxTn

)w − 2

(∑

n

ynxTn

)w

}+ const.

Design matrix and target vector

X =

xT1

xT2...xTN

∈ RN×(D+1), y =

y1y2...yN

Compact expression

RSS(w) = ||Xw − y||22 ={wTXTXw − 2

(XTy

)Tw

}+ const

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

Page 50: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

RSS(w) in new notationsFrom previous slide

RSS(w) =

{wT

(∑

n

xnxTn

)w − 2

(∑

n

ynxTn

)w

}+ const.

Design matrix and target vector

X =

xT1

xT2...xTN

∈ RN×(D+1)

, y =

y1y2...yN

Compact expression

RSS(w) = ||Xw − y||22 ={wTXTXw − 2

(XTy

)Tw

}+ const

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

Page 51: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

RSS(w) in new notationsFrom previous slide

RSS(w) =

{wT

(∑

n

xnxTn

)w − 2

(∑

n

ynxTn

)w

}+ const.

Design matrix and target vector

X =

xT1

xT2...xTN

∈ RN×(D+1), y =

y1y2...yN

Compact expression

RSS(w) = ||Xw − y||22 ={wTXTXw − 2

(XTy

)Tw

}+ const

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

Page 52: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

RSS(w) in new notationsFrom previous slide

RSS(w) =

{wT

(∑

n

xnxTn

)w − 2

(∑

n

ynxTn

)w

}+ const.

Design matrix and target vector

X =

xT1

xT2...xTN

∈ RN×(D+1), y =

y1y2...yN

Compact expression

RSS(w) = ||Xw − y||22 ={wTXTXw − 2

(XTy

)Tw

}+ const

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

Page 53: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 ={wTXTXw − 2

(XTy

)Tw

}+ const

Gradients of Linear and Quadratic Functions

∇xb>x = b

∇xx>Ax = 2Ax (symmetric A)

Normal equation

∇wRSS(w) ∝ XTXw − XTy = 0

This leads to the least-mean-square (LMS) solution

wLMS =(XTX

)−1XTy

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Page 54: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 ={wTXTXw − 2

(XTy

)Tw

}+ const

Gradients of Linear and Quadratic Functions

∇xb>x = b

∇xx>Ax = 2Ax (symmetric A)

Normal equation

∇wRSS(w) ∝ XTXw − XTy = 0

This leads to the least-mean-square (LMS) solution

wLMS =(XTX

)−1XTy

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Page 55: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 ={wTXTXw − 2

(XTy

)Tw

}+ const

Gradients of Linear and Quadratic Functions

∇xb>x = b

∇xx>Ax = 2Ax (symmetric A)

Normal equation

∇wRSS(w) ∝ XTXw − XTy = 0

This leads to the least-mean-square (LMS) solution

wLMS =(XTX

)−1XTy

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Page 56: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Solution in matrix form

Compact expression

RSS(w) = ||Xw − y||22 ={wTXTXw − 2

(XTy

)Tw

}+ const

Gradients of Linear and Quadratic Functions

∇xb>x = b

∇xx>Ax = 2Ax (symmetric A)

Normal equation

∇wRSS(w) ∝ XTXw − XTy = 0

This leads to the least-mean-square (LMS) solution

wLMS =(XTX

)−1XTy

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Page 57: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Mini-Summary

Linear regression is the linear combination of featuresf : x→ y, with f(x) = w0 +

∑dwdxd = w0 +w

Tx

If we minimize residual sum of squares as our learning objective, weget a closed-form solution of parameters

Probabilistic interpretation: maximum likelihood if assuming residualis Gaussian distributed

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 19 / 39

Page 58: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Computational complexity

Bottleneck of computing the solution?

w =(XTX

)−1Xy

Matrix multiply of XTX ∈ R(D+1)×(D+1)

Inverting the matrix XTX

How many operations do we need?

O(ND2) for matrix multiplication

O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recenttheoretical advances) for matrix inversion

Impractical for very large D or N

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39

Page 59: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Computational complexity

Bottleneck of computing the solution?

w =(XTX

)−1Xy

Matrix multiply of XTX ∈ R(D+1)×(D+1)

Inverting the matrix XTX

How many operations do we need?

O(ND2) for matrix multiplication

O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recenttheoretical advances) for matrix inversion

Impractical for very large D or N

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39

Page 60: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Computational complexity

Bottleneck of computing the solution?

w =(XTX

)−1Xy

Matrix multiply of XTX ∈ R(D+1)×(D+1)

Inverting the matrix XTX

How many operations do we need?

O(ND2) for matrix multiplication

O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recenttheoretical advances) for matrix inversion

Impractical for very large D or N

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39

Page 61: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Alternative method: an example of using numericaloptimization

(Batch) Gradient descent

Initialize w to w(0) (e.g., randomly); set t = 0; choose η > 0

Loop until convergence1 Compute the gradient∇RSS(w) = XTXw(t) − XTy

2 Update the parametersw(t+1) = w(t) − η∇RSS(w)

3 t← t+ 1

What is the complexity of each iteration?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Page 62: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Alternative method: an example of using numericaloptimization

(Batch) Gradient descent

Initialize w to w(0) (e.g., randomly); set t = 0; choose η > 0

Loop until convergence1 Compute the gradient∇RSS(w) = XTXw(t) − XTy

2 Update the parametersw(t+1) = w(t) − η∇RSS(w)

3 t← t+ 1

What is the complexity of each iteration?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Page 63: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Alternative method: an example of using numericaloptimization

(Batch) Gradient descent

Initialize w to w(0) (e.g., randomly); set t = 0; choose η > 0

Loop until convergence1 Compute the gradient∇RSS(w) = XTXw(t) − XTy

2 Update the parametersw(t+1) = w(t) − η∇RSS(w)

3 t← t+ 1

What is the complexity of each iteration?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Page 64: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Alternative method: an example of using numericaloptimization

(Batch) Gradient descent

Initialize w to w(0) (e.g., randomly); set t = 0; choose η > 0

Loop until convergence1 Compute the gradient∇RSS(w) = XTXw(t) − XTy

2 Update the parametersw(t+1) = w(t) − η∇RSS(w)

3 t← t+ 1

What is the complexity of each iteration?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Page 65: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Alternative method: an example of using numericaloptimization

(Batch) Gradient descent

Initialize w to w(0) (e.g., randomly); set t = 0; choose η > 0

Loop until convergence1 Compute the gradient∇RSS(w) = XTXw(t) − XTy

2 Update the parametersw(t+1) = w(t) − η∇RSS(w)

3 t← t+ 1

What is the complexity of each iteration?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Page 66: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Why would this work?

If gradient descent converges, it will converge to the same solutionas using matrix inversion.

This is because RSS(w) is a convex function in its parameters w

Hessian of RSS

RSS(w) = wTXTXw − 2(XTy

)Tw + const

⇒ ∂2RSS(w)

∂wwT= 2XTX

XTX is positive semidefinite, because for any v

vTXTXv = ‖XTv‖22 ≥ 0

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Page 67: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Why would this work?

If gradient descent converges, it will converge to the same solutionas using matrix inversion.

This is because RSS(w) is a convex function in its parameters w

Hessian of RSS

RSS(w) = wTXTXw − 2(XTy

)Tw + const

⇒ ∂2RSS(w)

∂wwT= 2XTX

XTX is positive semidefinite, because for any v

vTXTXv = ‖XTv‖22 ≥ 0

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Page 68: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Why would this work?

If gradient descent converges, it will converge to the same solutionas using matrix inversion.

This is because RSS(w) is a convex function in its parameters w

Hessian of RSS

RSS(w) = wTXTXw − 2(XTy

)Tw + const

⇒ ∂2RSS(w)

∂wwT= 2XTX

XTX is positive semidefinite, because for any v

vTXTXv = ‖XTv‖22 ≥ 0

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Page 69: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Why would this work?

If gradient descent converges, it will converge to the same solutionas using matrix inversion.

This is because RSS(w) is a convex function in its parameters w

Hessian of RSS

RSS(w) = wTXTXw − 2(XTy

)Tw + const

⇒ ∂2RSS(w)

∂wwT= 2XTX

XTX is positive semidefinite, because for any v

vTXTXv = ‖XTv‖22 ≥ 0

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Page 70: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Stochastic gradient descent

Widrow-Hoff rule: update parameters using one example at a time

Initialize w to some w(0); set t = 0; choose η > 0

Loop until convergence1 random choose a training a sample xt

2 Compute its contribution to the gradient

gt = (xTt w

(t) − yt)xt

3 Update the parametersw(t+1) = w(t) − ηgt

4 t← t+ 1

How does the complexity per iteration compare with gradient descent?

O(ND) for gradient descent versus O(D) for SGD

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Page 71: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Stochastic gradient descent

Widrow-Hoff rule: update parameters using one example at a time

Initialize w to some w(0); set t = 0; choose η > 0

Loop until convergence1 random choose a training a sample xt

2 Compute its contribution to the gradient

gt = (xTt w

(t) − yt)xt

3 Update the parametersw(t+1) = w(t) − ηgt

4 t← t+ 1

How does the complexity per iteration compare with gradient descent?

O(ND) for gradient descent versus O(D) for SGD

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Page 72: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Stochastic gradient descent

Widrow-Hoff rule: update parameters using one example at a time

Initialize w to some w(0); set t = 0; choose η > 0

Loop until convergence1 random choose a training a sample xt

2 Compute its contribution to the gradient

gt = (xTt w

(t) − yt)xt

3 Update the parametersw(t+1) = w(t) − ηgt

4 t← t+ 1

How does the complexity per iteration compare with gradient descent?

O(ND) for gradient descent versus O(D) for SGD

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Page 73: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Stochastic gradient descent

Widrow-Hoff rule: update parameters using one example at a time

Initialize w to some w(0); set t = 0; choose η > 0

Loop until convergence1 random choose a training a sample xt

2 Compute its contribution to the gradient

gt = (xTt w

(t) − yt)xt

3 Update the parametersw(t+1) = w(t) − ηgt

4 t← t+ 1

How does the complexity per iteration compare with gradient descent?

O(ND) for gradient descent versus O(D) for SGD

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Page 74: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Stochastic gradient descent

Widrow-Hoff rule: update parameters using one example at a time

Initialize w to some w(0); set t = 0; choose η > 0

Loop until convergence1 random choose a training a sample xt

2 Compute its contribution to the gradient

gt = (xTt w

(t) − yt)xt

3 Update the parametersw(t+1) = w(t) − ηgt

4 t← t+ 1

How does the complexity per iteration compare with gradient descent?

O(ND) for gradient descent versus O(D) for SGD

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Page 75: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Stochastic gradient descent

Widrow-Hoff rule: update parameters using one example at a time

Initialize w to some w(0); set t = 0; choose η > 0

Loop until convergence1 random choose a training a sample xt

2 Compute its contribution to the gradient

gt = (xTt w

(t) − yt)xt

3 Update the parametersw(t+1) = w(t) − ηgt

4 t← t+ 1

How does the complexity per iteration compare with gradient descent?

O(ND) for gradient descent versus O(D) for SGD

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Page 76: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Mini-summary

Batch gradient descent computes the exact gradient.

Stochastic gradient descent approximates the gradient with a singledata point; Its expectation equals the true gradient.

Mini-batch variant: trade-off between accuracy of estimating gradientand computational cost

Similar ideas extend to other ML optimization problems.I For large-scale problems, stochastic gradient descent often works well.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 24 / 39

Page 77: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

What if XTX is not invertible

Why might this happen?

Answer 1: N < D. Intuitively, not enough data to estimate all parameters.

Answer 2: Columns of X are not linearly independent, e.g., somefeatures are perfectly correlated. In this case, solution is not unique.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39

Page 78: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

What if XTX is not invertible

Why might this happen?

Answer 1: N < D. Intuitively, not enough data to estimate all parameters.

Answer 2: Columns of X are not linearly independent, e.g., somefeatures are perfectly correlated. In this case, solution is not unique.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39

Page 79: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

What if XTX is not invertible

Why might this happen?

Answer 1: N < D. Intuitively, not enough data to estimate all parameters.

Answer 2: Columns of X are not linearly independent, e.g., somefeatures are perfectly correlated. In this case, solution is not unique.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39

Page 80: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Ridge regression

Intuition: what does a non-invertible XTX mean? Consider the SVD ofthis matrix:

XTX = U

λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0

U>

where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D.

Fix the problem by ensuring all singular values are non-zero

XTX + λI = Udiag(λ1 + λ, λ2 + λ, · · · , λ)U>

where λ > 0 and I is the identity matrix

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39

Page 81: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Ridge regression

Intuition: what does a non-invertible XTX mean? Consider the SVD ofthis matrix:

XTX = U

λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0

U>

where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D.

Fix the problem by ensuring all singular values are non-zero

XTX + λI = Udiag(λ1 + λ, λ2 + λ, · · · , λ)U>

where λ > 0 and I is the identity matrix

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39

Page 82: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Ridge regression

Intuition: what does a non-invertible XTX mean? Consider the SVD ofthis matrix:

XTX = U

λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0

U>

where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D.

Fix the problem by ensuring all singular values are non-zero

XTX + λI = Udiag(λ1 + λ, λ2 + λ, · · · , λ)U>

where λ > 0 and I is the identity matrix

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39

Page 83: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Regularized least square (ridge regression)

Solution

w =(XTX + λI

)−1XTy

This is equivalent to adding an extra term to RSS(w)

RSS(w)︷ ︸︸ ︷1

2

{wTXTXw − 2

(XTy

)Tw

}+

1

2λ‖w‖22︸ ︷︷ ︸

regularization

Benefits

Numerically more stable, invertible matrix

Prevent overfitting — more on this later

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39

Page 84: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Regularized least square (ridge regression)

Solution

w =(XTX + λI

)−1XTy

This is equivalent to adding an extra term to RSS(w)

RSS(w)︷ ︸︸ ︷1

2

{wTXTXw − 2

(XTy

)Tw

}+

1

2λ‖w‖22︸ ︷︷ ︸

regularization

Benefits

Numerically more stable, invertible matrix

Prevent overfitting — more on this later

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39

Page 85: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Regularized least square (ridge regression)

Solution

w =(XTX + λI

)−1XTy

This is equivalent to adding an extra term to RSS(w)

RSS(w)︷ ︸︸ ︷1

2

{wTXTXw − 2

(XTy

)Tw

}+

1

2λ‖w‖22︸ ︷︷ ︸

regularization

Benefits

Numerically more stable, invertible matrix

Prevent overfitting — more on this later

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39

Page 86: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

How to choose λ?

λ is referred as hyperparameter

In contrast w is the parameter vector

Use validation set or cross-validation to find good choice of λ

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 28 / 39

Page 87: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Outline

1 Administration

2 Review of last lecture

3 Linear regression

4 Nonlinear basis functions

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 29 / 39

Page 88: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Is a linear modeling assumption always a good idea?Example of nonlinear classification

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Example of nonlinear regression

x

t

0 1

−1

0

1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 30 / 39

Page 89: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Is a linear modeling assumption always a good idea?Example of nonlinear classification

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Example of nonlinear regression

x

t

0 1

−1

0

1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 30 / 39

Page 90: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Nonlinear basis for classification

Transform the input/feature

φ(x) : x ∈ R2 → z = x1 · x2

Transformed training data: linearly separable!

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

−1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 31 / 39

Page 91: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Nonlinear basis for classification

Transform the input/feature

φ(x) : x ∈ R2 → z = x1 · x2

Transformed training data: linearly separable!

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

−1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 31 / 39

Page 92: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Another example

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

How to transform the input/feature?

φ(x) : x ∈ R2 → z =

x21x1 · x2x22

∈ R3

Transformed training data: linearly separable

0

5

10

15

−10−5

05

10

0

2

4

6

8

10

12

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39

Page 93: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Another example

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

How to transform the input/feature?

φ(x) : x ∈ R2 → z =

x21x1 · x2x22

∈ R3

Transformed training data: linearly separable

0

5

10

15

−10−5

05

10

0

2

4

6

8

10

12

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39

Page 94: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Another example

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

How to transform the input/feature?

φ(x) : x ∈ R2 → z =

x21x1 · x2x22

∈ R3

Transformed training data: linearly separable

0

5

10

15

−10−5

05

10

0

2

4

6

8

10

12

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39

Page 95: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

General nonlinear basis functions

We can use a nonlinear mapping

φ(x) : x ∈ RD → z ∈ RM

where M is the dimensionality of the new feature/input z (or φ(x)).

M could be greater than, less than, or equal to D

With the new features, we can apply our learning techniques to minimizeour errors on the transformed training data

linear methods: prediction is based on wTφ(x)

other methods: nearest neighbors, decision trees, etc

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 33 / 39

Page 96: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

General nonlinear basis functions

We can use a nonlinear mapping

φ(x) : x ∈ RD → z ∈ RM

where M is the dimensionality of the new feature/input z (or φ(x)).

M could be greater than, less than, or equal to D

With the new features, we can apply our learning techniques to minimizeour errors on the transformed training data

linear methods: prediction is based on wTφ(x)

other methods: nearest neighbors, decision trees, etc

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 33 / 39

Page 97: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Regression with nonlinear basis

Residual sum squares

n

[wTφ(xn)− yn]2

where w ∈ RM , the same dimensionality as the transformed features φ(x).

The LMS solution can be formulated with the new design matrix

Φ =

φ(x1)T

φ(x2)T

...φ(xN )T

∈ RN×M , wlms =

(ΦTΦ

)−1ΦTy

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 34 / 39

Page 98: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Regression with nonlinear basis

Residual sum squares

n

[wTφ(xn)− yn]2

where w ∈ RM , the same dimensionality as the transformed features φ(x).

The LMS solution can be formulated with the new design matrix

Φ =

φ(x1)T

φ(x2)T

...φ(xN )T

∈ RN×M , wlms =

(ΦTΦ

)−1ΦTy

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 34 / 39

Page 99: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Example with regressionPolynomial basis functions

φ(x) =

1xx2

...xM

⇒ f(x) = w0 +

M∑

m=1

wmxm

Fitting samples from a sine function: underrfitting as f(x) is too simple

x

t

M = 0

0 1

−1

0

1

x

t

M = 1

0 1

−1

0

1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 35 / 39

Page 100: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Example with regressionPolynomial basis functions

φ(x) =

1xx2

...xM

⇒ f(x) = w0 +

M∑

m=1

wmxm

Fitting samples from a sine function: underrfitting as f(x) is too simple

x

t

M = 0

0 1

−1

0

1

x

t

M = 1

0 1

−1

0

1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 35 / 39

Page 101: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Adding high-order terms

M=3

x

t

M = 3

0 1

−1

0

1

M=9: overfitting

x

t

M = 9

0 1

−1

0

1

More complex features lead to better results on the training data, butpotentially worse results on new data, e.g., test data!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 36 / 39

Page 102: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Adding high-order terms

M=3

x

t

M = 3

0 1

−1

0

1

M=9: overfitting

x

t

M = 9

0 1

−1

0

1

More complex features lead to better results on the training data, butpotentially worse results on new data, e.g., test data!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 36 / 39

Page 103: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Overfiting

Parameters for higher-order polynomials are very large

M = 0 M = 1 M = 3 M = 9

w0 0.19 0.82 0.31 0.35w1 -1.27 7.99 232.37w2 -25.43 -5321.83w3 17.37 48568.31w4 -231639.30w5 640042.26w6 -1061800.52w7 1042400.18w8 -557682.99w9 125201.43

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 37 / 39

Page 104: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Overfitting can be quite disastrous

Fitting the housing price data with large M

Predicted price goes to zero (and is ultimately negative) if you buy a bigenough house!

This is called poor generalization/overfitting.

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 38 / 39

Page 105: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor

Detecting overfitting

Plot model complexity versus objective function

X axis: model complexity, e.g., M

Y axis: error, e.g., RSS, RMS(square root of RSS), 0-1 loss

M

ERMS

0 3 6 90

0.5

1TrainingTest

As a model increases in complexity:

Training error keeps improving

Test error may first improve but eventually will deteriorate

Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 39 / 39