supervised learning - computer science - western universitydlizotte/teaching/cs4437... ·...

48
2017-03-05, 2:11 PM Supervised Learning Page 1 of 48 file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html Supervised Learning Dan Lizotte 2017-03-05 Relationships between variables Random vectors or vector-valued random variables. Variables that occur together in some meaningful sense. Joint distribution library(knitr); kable(head(faithful,10)) eruptions waiting 3.600 79 1.800 54 3.333 74 2.283 62 4.533 85 2.883 55 4.700 88

Upload: others

Post on 31-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 1 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Supervised LearningDan Lizotte2017-03-05

Relationships between variables

Random vectors or vector-valued random variables.Variables that occur together in some meaningful sense.

Joint distribution

library(knitr);kable(head(faithful,10))

eruptions waiting

3.600 79

1.800 54

3.333 74

2.283 62

4.533 85

2.883 55

4.700 88

Page 2: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 2 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

3.600 85

1.950 51

4.350 85

Correlation (JWHT 2.3,3.1.3)

Pearson Correlation

Pearson Correlation: “Plugin” Estimate

=ρX,YE[(X − )(Y − )]μX μY

σXσY

=rX,Y( − )( − )∑n

i=1 xi x yi y

( −∑ni=1 xi x)2‾ ‾‾‾‾‾‾‾‾‾‾‾√ ( −∑n

i=1 yi y)2‾ ‾‾‾‾‾‾‾‾‾‾‾√

Page 3: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 3 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Sample Correlation

## eruptions waiting## eruptions 1.0000000 0.9008112## waiting 0.9008112 1.0000000

Correlation Gotchas

Page 4: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 4 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

1 0.8 0.4 0 -0.4 -0.8 -1

1 1 1 -1 -1 -1

0 0 0 0 0 0 0

Joint distribution - Density

Marginal distributions - Densities

Page 5: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 5 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Marginal distributions - Rug Plot

Conditional distributions

Page 6: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 6 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Conditional distributions

Page 7: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 7 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Conditional distributions

Conditional distributions

Page 8: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 8 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Conditional distributions

Conditional distributions

Page 9: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 9 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Conditional distributions

Page 10: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 10 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Conditional distributions

Independence

Two random variables and that are part of a random vector are independent iff:

Consequences:

Any conditional distribution of is the same as the marginal distribution of Knowing about provides no information about .

Independence vs. Correlation

X Y

(x, y) = (x) (y)FX,Y FX FY

Pr(X = x|Y = y) = Pr(X = x)

Pr(Y = y|X = x) = Pr(Y = y)

X XY X

Page 11: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 11 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

## x y## x 1.000000 0.704671## y 0.704671 1.000000

Independence vs. Correlation

Page 12: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 12 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

## x y## x 1.000000000 -0.008239221## y -0.008239221 1.000000000

Independence vs. Correlation

## x y## x 1.0000000000 0.0005942638## y 0.0005942638 1.0000000000

Independence vs. Correlation

Page 13: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 13 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

## x y## x 1.0000000 0.0160426## y 0.0160426 1.0000000

Predicting Waiting Time

Page 14: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 14 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

## Mean: 70.90

Conditional predictionsIf I know eruption time, can I do better?

Page 15: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 15 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

## Mean: 55.60

Conditional predictionsIf I know eruption time, can I do better?

## Mean: 81.33

Conditional predictions?If I know eruption time, can I do better?

Page 16: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 16 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Supervised Learning Framework (HTF 2, JWHT 2)Training experience: a set of labeled examples of the form

where are feature values and is the output

Task: Given a new , predict

What to learn: A function , which maps the features into the output domain

Goal: Make accurate future predictions (on unseen data)Plan: Learn to make accurate predictions on the training data

Wisconsin Breast Cancer Prognostic Data(http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Prognostic))Cell samples were taken from tumors in breast cancer patients before surgery and imaged; tumors were excised; patients were followed todetermine whether or not the cancer recurred, and how long until recurrence or disease free.

⟨ , , … , y⟩,x1 x2 xp

xj y

, , …x1 x2 xp y

f : × × ⋯ × → 1 2 p

Page 17: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 17 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

image

Wisconsin data (continued)198 instances, 32 features for predictionOutcome (R=recurrence, N=non-recurrence)Time (until recurrence, for R, time healthy, for N).

Radius.Mean Texture.Mean Perimeter.Mean … Outcome Time

18.02 27.60 117.50 N 31

17.99 10.38 122.80 N 61

21.37 17.44 137.50 N 116

11.42 20.38 77.58 N 123

20.29 14.34 135.10 R 27

12.75 15.29 84.60 R 77

… … … … …

TerminologyRadius.Mean Texture.Mean Perimeter.Mean … Outcome Time

18.02 27.60 117.50 N 31

17.99 10.38 122.80 N 61

21.37 17.44 137.50 N 116

11.42 20.38 77.58 N 123

20.29 14.34 135.10 R 27

Page 18: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 18 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

12.75 15.29 84.60 R 77

… … … … …

Columns are called input variables or features or attributesThe outcome and time (which we are trying to predict) are called labels or output variables or targetsA row in the table is called training example or instanceThe whole table is called (training) data set.

Prediction problemsRadius.Mean Texture.Mean Perimeter.Mean … Outcome Time

18.02 27.60 117.50 N 31

17.99 10.38 122.80 N 61

21.37 17.44 137.50 N 116

11.42 20.38 77.58 N 123

20.29 14.34 135.10 R 27

12.75 15.29 84.60 R 77

… … … … …

The problem of predicting the recurrence is called (binary) classification

The problem of predicting the time is called regression

More formallyThe th training example has the form: where is the number of features (32 in our case).

Notation denotes a column vector with elements .

The training set consists of training examples

We denote the matrix of features by and the size- column vector of outputs from the data set by .

In statistics, is called the data matrix or the design matrix.

denotes space of input values

denotes space of output values

Supervised learning problemGiven a data set , find a function:

such that is a “good predictor” for the value of .

is called a predictive model or hypothesis

Problems are categorized by the type of output domain

If , this problem is called regression

i ⟨ , … , ⟩x1,i xp,i yi p

xi , …x1,i xp,i

D n

n × p X n yX

D ⊂ ( × )n

h : →

h(x) y

h

= ℝ

Page 19: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 19 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

If is a finite discrete set, the problem is called classification

If has 2 elements, the problem is called binary classification

Steps to solving a supervised learning problem1. Decide what the input-output pairs are.

2. Decide how to encode inputs and outputs.

This defines the input space , and the output space .

(We will discuss this in detail later)

3. Choose model space/hypothesis class .

4. …

Example: Choosing a model space

Linear hypothesis (HTF 3, JWHT 3)Suppose was a linear function of :

are called parameters or weights (often in stats books)

Typically include an attribute (also called bias term or intercept term) so that the number of weights is . We then write:

where and are column vectors of size .

y x

(x) = + + + ⋯hw w0 w1x1 w2x2

wi βi

= 1x0 p + 1

(x) = = whw ∑i=0

p

wixi x+

w x p + 1

Page 20: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 20 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

The design matrix is now by .

Example: Design matrix with bias termx0 x1 y

1 0.86 2.49

1 0.09 0.83

1 -0.85 -0.25

1 0.87 3.10

1 -0.44 0.87

1 -0.43 0.02

1 -1.10 -0.12

1 0.40 1.81

1 -0.96 -0.83

1 0.17 0.43

Models will be of the form

How should we pick ?

Error minimizationIntuitively, should make the predictions of close to the true values on on the training data

Define an error function or cost function to measure how much our prediction differs from the “true” answer on the training data

Pick such that the error function is minimized

Hopefully, new examples are somehow “similar” to the training examples, and will also have small error.

How should we choose the error function?

Least mean squares (LMS)Main idea: try to make close to on the examples in the training set

We define a sum-of-squares error function

(the is just for convenience)

We will choose such as to minimize

One way to do it: compute such that:

Example: ## SSE: 21.510

X n p + 1

(x)hw = +x0 w0 x1 w1= +w0 x1 w1

w

w hw yi

w

(x)hw y

J(w) = ( ( ) −12 ∑

i=1

nhw xi yi)2

1/2w J(w)

w

J(w) = 0, ∀j = 0 … p∂∂wj

= 0.9, = −0.4w0 w1

Page 21: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 21 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

OLS Fit to Example Datamod <- lm(y ~ x1, data=exb); print(mod$coefficients)

## (Intercept) x1 ## 1.058813 1.610168

## SSE: 2.240

Page 22: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 22 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Solving a supervised learning problem1. Decide what the input-output pairs are.

2. Decide how to encode inputs and outputs.

This defines the input space , and the output space .

3. Choose a class of models/hypotheses .

4. Choose an error function (cost function) to define the best model in the class

5. Choose an algorithm for searching efficiently through the space of models to find the best.

Recurrence Time from Tumor Radiusmod <- lm(Time ~ Radius.Mean, data=bc %>% filter(Outcome == 'R')); print(mod$coefficients)

## (Intercept) Radius.Mean ## 83.161238 -3.156896

Page 23: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 23 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Notation reminderConsider a function (for us, this will usually be an error function)

The gradient is a function which outputs a vector containing the partial derivatives.That is:

If is differentiable and convex, we can find the global minimum of by solving .

The partial derivative is the derivative along the axis, keeping all other variables fixed.

The Least Squares Solution (HTF 2.6, 3.2, JWHT 3.1)

Recalling some multivariate calculus:

Setting gradient equal to zero:

The inverse exists if the columns of are linearly independent.

J( , , … , ) : ↦ ℝu1 u2 up ℝp

∇J( , , … , ) : ↦u1 u2 up ℝp ℝp

∇J = ⟨ J, J, … , J⟩∂∂u1

∂∂u2

∂∂up

J J ∇J = 0ui

J∇w =====

(Xw − y (Xw − y)∇w )+

( − )(Xw − y)∇w w+X+ y+

( Xw − Xw − y + y y)∇w w+X+ y+ w+X+ +

( Xw − 2 Xw + y y)∇w w+X+ y+ +

2 Xw − 2 yX+ X+

2 Xw − 2 yX+ X+

⇒ XwX+

⇒ w = ( X yX+ )−1X+

==

0yX+

X

Page 24: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 24 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Example of linear regression

x0 x1 y

1 0.86 2.49

1 0.09 0.83

1 -0.85 -0.25

1 0.87 3.10

1 -0.44 0.87

1 -0.43 0.02

1 -1.10 -0.12

1 0.40 1.81

1 -0.96 -0.83

1 0.17 0.43

Data matrices

(x) = 1.06 + 1.61hw x1

X = y =

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

1111111111

0.860.09

−0.850.87

−0.44−0.43−1.10

0.40−0.96

0.17

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

2.490.83

−0.253.100.870.02

−0.121.81

−0.830.43

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

XX +

X =X+

[ ] ×1

0.861

0.091

−0.851

0.871

−0.441

−0.431

−1.101

0.401

−0.961

0.17

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

1111111111

0.860.09

−0.850.87

−0.44−0.43−1.100.40

−0.960.17

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

= [ ]

Page 25: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 25 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Solving for

So the best fit line is .

Linear regression summaryThe optimal solution (minimizing sum-squared-error) can be computed in polynomial time in the size of the data set.

The solution is , where is the data matrix augmented with a column of ones, and is the column vector of targetoutputs.

A very rare case in which an analytical, exact solution is possible

Is linear regression enough?Linear regression should be the first thing you try for real-valued outputs!

…but it is sometimes not expressive enough.

Two possible solutions:

1. Explicitly transform the data, i.e. create additional features

Add cross-terms, higher-order terms

More generally, apply a transformation of the inputs from to some other space , then do linear regression in thetransformed space

2. Use a different model space/hypothesis class

Idea (1) and idea (2) are two views of the strategy. Today we focus on the first approach

Polynomial fits (HTF 2.6, JWHT 7.1)Suppose we want to fit a higher-degree polynomial to the data.(E.g., .)

= [ ]10−1.39

−1.394.95

yX +

y =X+

[ ] ×1

0.861

0.091

−0.851

0.871

−0.441

−0.431

−1.101

0.401

−0.961

0.17

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

2.490.83

−0.253.100.870.02

−0.121.81

−0.830.43

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

= [ ]8.346.49

ww = ( X y = [ ] = [ ]X+ )−1X+ [ ]10

−1.39−1.39

4.95

−1 8.346.49

1.061.61

y = 1.06 + 1.61x

w = ( X yX+ )−1X+ X y

y = + +w0 w1x1 w2x21

Page 26: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 26 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Suppose for now that there is a single input variable per training sample.

How do we do it?

Answer: Polynomial regressionGiven data: .

Suppose we want a degree- polynomial fit.

Let be as before and let

We are making up features to add to our design matrix

Solve the linear regression .

Example of quadratic regression: Data matrices

xi,1

( , ), ( , ), … , ( , )x1,1 y1 x1,2 y2 x1,n yn

d

y

X =

⎢⎢⎢⎢⎢

11⋮1

x1,1

x1,2

x1,n

x21,1

x21,2

⋮x2

1,n

……⋮

xd1,1

xd1,2

⋮xd

1,n

⎥⎥⎥⎥⎥

Xw ≈ y

X = y =

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

1111111111

0.860.09

−0.850.87

−0.44−0.43−1.10

0.40−0.96

0.17

0.750.010.730.760.190.181.220.160.930.03

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

2.490.83

−0.253.100.870.02

−0.121.81

−0.830.43

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

XX +

X =X+

×

⎢⎢⎢

10.860.75

10.090.01

1−0.85

0.73

10.870.76

1−0.44

0.19

1−0.43

0.18

1−1.10

1.22

10.400.16

1−0.96

0.93

10.170.03

⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

1111111111

0.860.09

−0.850.87

−0.44−0.43−1.10

0.40−0.96

0.17

0.750.010.730.760.190.181.220.160.930.03

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

=⎡

⎣⎢⎢

10−1.39

4.95

−1.394.951.64

4.951.644.11

⎦⎥⎥

yX +

y =+

Page 27: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 27 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Solving for

So the best order-2 polynomial is .

Data and linear fit## (Intercept) x ## 1.1 1.6

Data and quadratic fit

y =X+

×

⎢⎢⎢

10.860.75

10.090.01

1−0.85

0.73

10.870.76

1−0.44

0.19

1−0.43

0.18

1−1.10

1.22

10.400.16

1−0.96

0.93

10.170.03

⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

2.490.83

−0.253.100.870.02

−0.121.81

−0.830.43

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

=⎡

⎣⎢⎢

8.346.493.60

⎦⎥⎥

ww = ( X y = =X+ )−1X+

⎢⎢⎢

10−1.39

4.95

−1.394.951.64

4.951.644.11

⎥⎥⎥

−1 ⎡

⎢⎢⎢

3.606.498.34

⎥⎥⎥

⎢⎢⎢

0.741.750.69

⎥⎥⎥

y = 0.74 + 1.75x + 0.69x2

Page 28: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 28 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

## (Intercept) x I(x^2) ## 0.74 1.75 0.69

Is this a better fit to the data?

Order-3 fit## (Intercept) x I(x^2) I(x^3) ## 0.71 1.39 0.80 0.46

Page 29: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 29 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Is this a better fit to the data?

Order-4 fit## (Intercept) x I(x^2) I(x^3) I(x^4) ## 0.795 1.128 -0.039 0.905 0.898

Page 30: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 30 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Is this a better fit to the data?

Order-5 fit## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) ## 0.47 0.62 4.86 6.75 -5.25 -6.72

Page 31: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 31 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Is this a better fit to the data?

Order-6 fit## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) ## 0.13 3.13 8.99 -11.11 -23.83 12.52 18.38

Page 32: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 32 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Is this a better fit to the data?

Order-7 fit## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) I(x^7) ## 0.096 3.207 10.193 -11.078 -30.742 8.263 25.527 5.483

Page 33: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 33 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Is this a better fit to the data?

Order-8 fit## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) I(x^7) I(x^8) ## 1.3 -5.9 -5.1 69.9 48.8 -172.0 -131.9 123.3 101.2

Page 34: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 34 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Is this a better fit to the data?

Order-9 fit## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) I(x^7) I(x^8) I(x^9) ## -1.1 34.8 -127.9 -379.9 1186.9 1604.8 -2475.4 -2627.6 1499.6 1448.1

Page 35: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 35 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Is this a better fit to the data?

Evaluating Performance

Which do you prefer and why?

Page 36: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 36 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Performance of a Fixed Hypothesis(HTF 7.1–7.4, JWHT 2.2, 5)

Assume data are drawn from some fixed distribution

Given a model , (which could have come from anywhere), its generalization error is:

Given a set of data points from the same distribution, we can compute the empirical error

is an unbiased estimate of so long as the data did not influence the choice of .

Can use with CLT or bootstrap to get a C.I. for .

Test Error: The Gold Standard

is an unbiased estimate of so long as the do not influence . Can use to get a confidence interval for .

Gives a strong statistical guarantee about the true performance of our system, if we didn’t use the test data to choose .

We can write “training error” for model class on a given data set as

- Let the corresponding learned hypothesis be

Obviously, for any data set, .

Model Selection and Performance1. We would like to estimate the generalization error of our resulting predictor.

2. We would like to choose the best model space (e.g. linear, quadratic, …)

Problem 1: Estimating Generalization ErrorTraining error systematically underestimates generalization error for the learned hypothesis .

Problem 2: Model SelectionThe more complex the model, the smaller the training error.

(x, y)h

= E[L(h(X), Y)]J∗h

= L(h( ), )J ∗h

1n ∑

i=1

mxi yi

J ∗h J∗

h h

J ∗h J∗

h

= L(h( ), )J ∗h

1n ∑

i=1

nxi yi

J ∗h J∗

h ( , )xi yi h J ∗h J∗

h

h

= L( ( ), )J min

∈h′

1n ∑

i=1

nh′ xi yi

= arg L( ( ), )h∗ min∈h′

1n ∑

i=1

nh′ xi yi

≤J J ∗

h

J J∗

h∗ h∗

Page 37: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 37 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Training error of the degree-9 polynomial is 0.

Training error of the degree-9 polynomial on any set of 10 points is 0.

Problem 2: Model SelectionSmaller training error does not mean smaller generalization error.

Suppose is the space of all linear functions, is the space of all quadratic functions. Note .

Fix a data set.

Let and , both computed using the same dataset.

We must have , but we may have .

Problem 2: Model Selection

1 2 ⊂1 2

= argh∗1 min ∈h′ 1 J ∗

h′ = argh∗2 min ∈h′ 2 J ∗

h′

≤J ∗h∗

2J ∗

h∗1

>J∗h2

J∗h1

Page 38: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 38 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Training error is no good for choosing the model space.

Fixing Problem 1: Generalization ErrorTraining error underestimates generalization error

If you really want a good estimate of , you need a separate test set

(But new stat methods can produce a CI using training error)

Could report test error, then deploy whatever you train on the whole data. (Probably won’t be worse.)

Fixing Problem 2: Model SelectionSmaller training error does not mean smaller generalization error.

Small training error, large generalization error is known as overfitting

A separate validation set can be used for model selection.Train on the training set using each proposed model spaceevaluate each on the validation set, choose the one with lowest validation error

Training, Model Selection, and Error EstimationA general procedure for estimating the true error of a specific learned model using model selection

The data is randomly partitioned into three disjoint subsets:

A training set used only to find the parameters

A validation set used to find the right model space (e.g., the degree of the polynomial)

A test set used to estimate the generalization error of the resulting model

Can generate standard confidence intervals for the generalization error of the learned model

J J∗

h

J∗h

w

Page 39: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 39 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Problems with the Single-Partition ApproachPros:

Measures what we want: Performance of the actual learned model.Cons:

Smaller effective training sets make performance more variable.

Small validation sets can give poor model selection

Small test sets can give poor estimates of performance

For a test set of size 100, with 60 correct classifications, 95% C.I. for actual accuracy is .

k-fold cross-validation (HTF 7.10, JWHT 5.1)

Divide the instances into disjoint partitions or folds of size

Loop through the partitions :

Partition is for evaluation (i.e., estimating the performance of the algorithm after learning is done)

The rest are used for training (i.e., choosing the specific model within the space)

“Cross-Validation Error” is the average error on the evaluation partitions. Has lower variance than error on one partition.

This is the main CV idea; CV is used for different purposes though.

Misuse of CV (HTF 7.10.2) examples, binary classification, balanced classes

features, all statistically independent of

Use model selection to find best features by correlation on entire dataset.

Use cross-validation with these to estimate error.

CV-based error rate was 3%.

k-fold cross-validation model selection (HTF 7.10, JWHT 5.1)

Divide the instances into folds of size .

Loop over model spaces

Loop over the folds :

Fold is for validation (i.e., estimating the performance of the algorithm after learning is done)

The rest are used for training (i.e., choosing the specific model within the space)

For each model space, report average error over folds, and standard error.

CV for Model Selection

(0.497, 0.698)

k n/ki = 1. . . k

i

n = 50p = 5000 y

100p = 100

k n/km 1. . . m

k i = 1. . . k

i

Page 40: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 40 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

CV for Model Selection

Page 41: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 41 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

CV for Model Selection

Page 42: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 42 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

CV for Model Selection

Page 43: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 43 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

CV for Model Selection

Page 44: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 44 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

CV for Model Selection

Page 45: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 45 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

CV for Model SelectionTypically select “most parsimonious model whose error is no more than one standard error above the error of the best model.” (HTF)

Page 46: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 46 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Estimating “which is best” vs. “performance of best”

Page 47: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 47 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Estimated errors using 290 model spaces. (http://bioinformatics.oxfordjournals.org/content/30/22/3152.long)

Nested CV for Model EvaluationDivide the instances into “outer” folds of size .

Loop over the outer folds :Fold is for testing; all others for training.Divide the training instances into “inner” folds of size .Loop over model spaces

Loop over the inner folds :Fold is for validationThe rest are used for training

Use average error over folds and SE to choose model space.

Train on all inner folds.

Test the model on outer test fold

Nested CV for Model Evaluation

Generalization Error for degree 3 model

k n/kk i = 1. . . k

ik ′ (n − n/k)/k ′

m 1. . . m

k ′ j = 1. . . k ′

j

Page 48: Supervised Learning - Computer Science - Western Universitydlizotte/teaching/cs4437... · Supervised Learning 2017-03-05, 2:11 PM ... Conditional predictions If I know eruption time,

2017-03-05, 2:11 PMSupervised Learning

Page 48 of 48file:///Users/dlizotte/Seafile/My%20Library/Teaching/cs4437/Lectures/4_Supervised%20Learning/supervised_learning.html

Minimum-CV Estimate: 128.48, Nested CV Estimate: 149.91

Bias-correction for the CV ProcedureCawley, Talbot. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR v.11, 2010.(http://jmlr.org/papers/volume11/cawley10a/cawley10a.pdf)

Tibshirani, Tibshirani. A bias correction for the minimum error rate in cross-validation. arXiv. 2009. (http://arxiv.org/abs/0908.2904)

Ding et al. Bias correction for selecting the minimal-error classifier from many machine learning models. Bioinformatics 30 (22). 2014.(http://bioinformatics.oxfordjournals.org/content/30/22/3152.long)

SummaryThe training error decreases with the degree of the polynomial , i.e. the complexity (size) of the model space

Generalization error decreases at first, then starts increasing

Set aside a validation set helps us find a good model space

We then can report unbiased error estimate, using a test set, untouched during both parameter training and validation

Cross-validation is a lower-variance but possibly biased version of this approach. It is standard.

If you have lots of data, just use held-out validation and test sets.

M