machine learning and data mining linear regression

27
Machine Learning and Data Mining Linear regression (adapted from) Prof. Alexander Ihler +

Upload: aurelia-valenzuela

Post on 03-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

+. Machine Learning and Data Mining Linear regression. (adapted from) Prof. Alexander Ihler. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A. Supervised learning. Notation Features x Targets y Predictions ŷ Parameters q. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Learning and Data Mining Linear regression

Machine Learning and Data Mining

Linear regression

(adapted from) Prof. Alexander Ihler

+

Page 2: Machine Learning and Data Mining Linear regression

Supervised learning• Notation

– Features x– Targets y– Predictions ŷ– Parameters q

Program (“Learner”)

Characterized by some “parameters” θ

Procedure (using θ) that outputs a prediction

Training data (examples)

Features

Learning algorithm

Change θImprove performance

Feedback / Target values Score performance

(“cost function”)

Page 3: Machine Learning and Data Mining Linear regression

Linear regression

• Define form of function f(x) explicitly• Find a good f(x) within that family

0 10 200

20

40

Tar

get

y

Feature x

“Predictor”:Evaluate line:

return r

(c) Alexander Ihler

Page 4: Machine Learning and Data Mining Linear regression

More dimensions?

010

2030

40

0

10

20

30

20

22

24

26

010

2030

40

0

10

20

30

20

22

24

26

x1 x2

y

x1 x2

y

(c) Alexander Ihler

Page 5: Machine Learning and Data Mining Linear regression

Notation

Define “feature” x0 = 1 (constant)Then

(c) Alexander Ihler

Page 6: Machine Learning and Data Mining Linear regression

Measuring error

0 200

Error or “residual”

Prediction

Observation

(c) Alexander Ihler

Page 7: Machine Learning and Data Mining Linear regression

Mean squared error• How can we quantify the error?

• Could choose something else, of course…– Computationally convenient (more later)– Measures the variance of the residuals– Corresponds to likelihood under Gaussian model of “noise”

(c) Alexander Ihler

Page 8: Machine Learning and Data Mining Linear regression

MSE cost function

• Rewrite using matrix form

(Matlab) >> e = y’ – th*X’; J = e*e’/m;

(c) Alexander Ihler

Page 9: Machine Learning and Data Mining Linear regression

Visualizing the cost function

-1 -0.5 0 0.5 1 1.5 2 2.5 3-40

-30

-20

-10

0

10

20

30

40

-1 -0.5 0 0.5 1 1.5 2 2.5 3-40

-30

-20

-10

0

10

20

30

40

(c) Alexander Ihler

θ1 θ

0

J(θ

)

Page 10: Machine Learning and Data Mining Linear regression

Supervised learning• Notation

– Features x– Targets y– Predictions ŷ– Parameters q

Program (“Learner”)

Characterized by some “parameters” θ

Procedure (using θ) that outputs a prediction

Training data (examples)

Features

Learning algorithm

Change θImprove performance

Feedback / Target values Score performance

(“cost function”)

Page 11: Machine Learning and Data Mining Linear regression

Finding good parameters• Want to find parameters which minimize our error…

• Think of a cost “surface”: error residual for that θ…

(c) Alexander Ihler

Page 12: Machine Learning and Data Mining Linear regression

Machine Learning and Data Mining

Linear regression: direct minimization

(adapted from) Prof. Alexander Ihler

+

Page 13: Machine Learning and Data Mining Linear regression

MSE Minimum

• Consider a simple problem– One feature, two data points– Two unknowns: µ0, µ1

– Two equations:

• Can solve this system directly:

• However, most of the time, m > n– There may be no linear function that hits all the data exactly– Instead, solve directly for minimum of MSE function

(c) Alexander Ihler

Page 14: Machine Learning and Data Mining Linear regression

SSE Minimum

• Reordering, we have

• X (XT X)-1 is called the “pseudo-inverse”

• If XT is square and independent, this is the inverse• If m > n: overdetermined; gives minimum MSE fit

(c) Alexander Ihler

Page 15: Machine Learning and Data Mining Linear regression

Matlab SSE• This is easy to solve in Matlab…

% y = [y1 ; … ; ym]% X = [x1_0 … x1_m ; x2_0 … x2_m ; …]

% Solution 1: “manual” th = y’ * X * inv(X’ * X);

% Solution 2: “mrdivide” th = y’ / X’; % th*X’ = y => th = y/X’

(c) Alexander Ihler

Page 16: Machine Learning and Data Mining Linear regression

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18

Effects of MSE choice• Sensitivity to outliers

16 2 cost for this one datum

Heavy penalty for large errors

-20 -15 -10 -5 0 50

1

2

3

4

5

(c) Alexander Ihler

Page 17: Machine Learning and Data Mining Linear regression

L1 error

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18 L2, original data

L1, original data

L1, outlier data

(c) Alexander Ihler

Page 18: Machine Learning and Data Mining Linear regression

Cost functions for regression

“Arbitrary” functions can’t besolved in closed form…

- use gradient descent

(MSE)

(MAE)

Something else entirely…

(???)

(c) Alexander Ihler

Page 19: Machine Learning and Data Mining Linear regression

Machine Learning and Data Mining

Linear regression: nonlinear features

(adapted from) Prof. Alexander Ihler

+

Page 20: Machine Learning and Data Mining Linear regression

Nonlinear functions• What if our hypotheses are not lines?

– Ex: higher-order polynomials

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18Order 1 polynom ial

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18Order 3 polynom ial

(c) Alexander Ihler

Page 21: Machine Learning and Data Mining Linear regression

Nonlinear functions• Single feature x, predict target y:

• Sometimes useful to think of “feature transform”

Add features:

Linear regression in new features

(c) Alexander Ihler

Page 22: Machine Learning and Data Mining Linear regression

Higher-order polynomials• Fit in the same way• More “features”

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18Order 1 polynom ial

0 2 4 6 8 10 12 14 16 18 20-2

0

2

4

6

8

10

12

14

16

18Order 2 polynom ial

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18Order 3 polynom ial

Page 23: Machine Learning and Data Mining Linear regression

Features• In general, can use any features we think are useful

• Other information about the problem– Sq. footage, location, age, …

• Polynomial functions– Features [1, x, x2, x3, …]

• Other functions– 1/x, sqrt(x), x1 * x2, …

• “Linear regression” = linear in the parameters– Features we can make as complex as we want!

(c) Alexander Ihler

Page 24: Machine Learning and Data Mining Linear regression

Higher-order polynomials• Are more features better?

• “Nested” hypotheses– 2nd order more general than 1st,– 3rd order “ “ than 2nd, …

• Fits the observed data better

Page 25: Machine Learning and Data Mining Linear regression

Overfitting and complexity• More complex models will always fit the training data

better• But they may “overfit” the training data, learning

complex relationships that are not really present

X

Y

Complex model

X

Y

Simple model

(c) Alexander Ihler

Page 26: Machine Learning and Data Mining Linear regression

Test data• After training the model• Go out and get more data from the world

– New observations (x,y)

• How well does our model perform?

(c) Alexander Ihler

Training data

New, “test” data

Page 27: Machine Learning and Data Mining Linear regression

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

5

10

15

20

25

30

Training data

Training versus test error• Plot MSE as a function

of model complexity– Polynomial order

• Decreases– More complex function

fits training data better

• What about new data?M

ean

squ

ared

err

or

Polynomial order

New, “test” data

• 0th to 1st order– Error decreases– Underfitting

• Higher order– Error increases– Overfitting

(c) Alexander Ihler