an overview of statistical learninghchipman/talks/bigdatabootcamp2015.pdf · an overview of...

61
An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

Upload: others

Post on 24-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

An Overview of Statistical Learning

Hugh Chipman

Acadia University

January 12, 2015

Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applicationsin R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

Page 2: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Outline

Introduction

Cartoon examples of supervised learning

Cross-validation and computational methods for inference

Classification

Big ideas in statistical learning

Some thoughts

Page 3: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

2014 AARMS summer school class (Sunny Wang, StatisticalLearning co-teacher 2nd from right, first row)

Page 4: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Primary source: Introduction to StatisticalLearning with Applications in R by James,Witten, Hastie and Tibshirani

Some other resources:

I Statistical Learning and Data Mining,Hastie, Tibshirani and Friedman

I Pattern Recognition and MachineLearning, Bishop

I Bayesian Methods for NonlinearClassification and Regression, Denison,Holmes, Mallick and Smith.

Page 5: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Some examples of statistical learning

Wage data : Predict salary using demographic variables

20 40 60 80

50

10

02

00

30

0

Age

Wa

ge

2003 2006 2009

50

10

02

00

30

0

Year

Wa

ge

1 2 3 4 5

50

10

02

00

30

0

Education Level

Wa

ge

Plots show dependence of wage on individual predictors

Page 6: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Some examples of statistical learning

Drug discovery:

I Identify compounds with desirable effect on biological target

I Response variable: Activity (inactive/active)

I Explanatory variables: Molecular descriptors

I Use high throughput screening to test thousands ofcompounds, then build a model to predict activity for othercompounds.

Page 7: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Some examples of statistical learning

Page 8: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Some examples of statistical learning: Supervised Learning

The wage and drug discovery problems are examples of SupervisedLearning.

I We seek to predict a response Y using predictors X .

I We have available a training sample of (X ,Y ) pairs.

I Continuous response (wage) ⇒ “regression”

I Categorical response (drug discovery) ⇒ “classification”

Although not the focus of this overview, there are also methods forunsupervised learning

I Discover structure in X without an observed Y .

I Clustering, principal component analysis, graphical models, ...

Page 9: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

An unsupervised learning example

I Engine assemblyprocess.

I Steel valve seatsforce-fitted intocylinder head.

I Data: force profile vs.time for eachinsertion

I Problem: someinsertions bad, but wecan’t tell which ones.

Page 10: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

An unsupervised learning example

0 20 40 60 80 100

010

0020

0030

0040

0050

00

Mean Curves by Cluster (February)

For

ce

m1 = 164m2 = 465m3 = 36

I Each observation is acurve

I We have thousands ofcurves

I Try to group togethercurves and identifyanomalous insertions

I “grouping” =“clustering”

Page 11: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Outline

Introduction

Cartoon examples of supervised learning

Cross-validation and computational methods for inference

Classification

Big ideas in statistical learning

Some thoughts

Page 12: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Regression

y = f (x) + ε

I y = response variable

I x = predictor variable(s)

I f (x) is an unknown function we wish to estimate (“learn”)

I ε is a random error

y = signal + noise

Statistical learning typically focuses on estimation of “signal”, withminimal attention given to “noise”.

Page 13: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

A one-dimensional regression example

I One dataset (“training data”) and 3 different regressionmodels.

I Polynomial regression y = β0 + β1x + β2x2 + . . .+ βdxd + ε.

I Objectives:1) choose flexibility (d) 2) estimate parameters (β’s)

I Prediction model: f (x) = β0 + β1x + β2x2 + . . .+ βdxd

●●

●●

●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

linear

●●

●●

●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

4th order polynomial

y

●●

●●

●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

21st order polynomial

y

Page 14: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

How to choose a suitable flexibility?

One very general approach: use a test set.

I A set of data points not used to estimate the parameters.

I Plot below: errors on training and test sets.

●●

●●

●●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

training set predictions

x

y

●●

●●

●●

●●

●●

●●

●●

●●

●●

0 20 40 60 80 100

45

67

89

test set predictions

x

y

Page 15: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

How to choose a suitable flexibility?

In this case (an order d = 21 polynomial), test set errors are larger.

This suggests our model may be too flexible and a smaller dshould be used.

training errors

absolute error

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0 2.5 3.0

05

1015

2025

30

test errors

absolute error

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0 2.5 3.0

05

1015

2025

30

Page 16: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Training and test errors as a function of flexibilityReturning to the 3 different models (left panel), we can computethe mean squared error for a training or a test set.

MSE =1

n

n∑i=1

(yi − f (xi ))2

MSE will vary as a function of flexibility (right panel):

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Mean S

quare

d E

rror

I monotonedecreasingshape fortraining set

I “U” shapefor test set

Page 17: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

The bias-variance trade-off

I The linear model is not flexible enough: biased.

I The order 21 polynomial is too flexible: variable.

This is the bias variance trade-off

●●

●●

●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

linear

●●

●●

●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

4th order polynomial

y

●●

●●

●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

21st order polynomial

y

Page 18: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

The bias-variance trade-off

●●

●●

●●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

x

linear

fittedtrue

●●

●●

●●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

x

y

4th order polynomial

●●

●●

●●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

x

y

21st order polynomial

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

0 20 40 60 80

45

67

89

x

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

0 20 40 60 80

45

67

89

x

y

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

0 20 40 60 80

45

67

89

x

y

●●

● ●

●●

●●

●●

●●●

●●●

● ●

●●

●●

0 20 40 60 80

45

67

89

●●

● ●

●●

●●

●●

●●●

●●●

● ●

●●

●●

0 20 40 60 80

45

67

89

y

●●

● ●

●●

●●

●●

●●●

●●●

● ●

●●

●●

0 20 40 60 80

45

67

89

y

Training set 1

Training set 2

Training set 3

Page 19: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

The bias-variance trade-off

Combine the 3 fits in a single plot:

0 20 40 60 80 100

34

56

78

9

linear

x

y

0 20 40 60 80 100

34

56

78

9

4th order polynomial

x

y

0 20 40 60 80 100

34

56

78

9

21st order polynomial

x

y

High bias Just Low biasLow variance right High variance

Page 20: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Other one-dimensional examples

True function is nearly linear, noise level is high(previous example was nonlinear, high noise)

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Me

an

Sq

ua

red

Err

or

Page 21: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Other one-dimensional examples

True function is nonlinear, noise level is lowWhat’s the best flexibility? It depends!

0 20 40 60 80 100

−1

00

10

20

X

Y

2 5 10 20

05

10

15

20

Flexibility

Me

an

Sq

ua

red

Err

or

Page 22: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Supervised learning

In the three examples, we can break down the MSE into bias andvariance:

nonlinear linear nonlinearhigh noise high noise low noise

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

2 5 10 20

05

10

15

20

Flexibility

MSEBiasVar

Page 23: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

K-nearest neighbours with K=10

●●

●●

●●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

x

y

Page 24: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

K-nearest neighbours with K=10

●●

●●

●●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

x

y We want topredict y atx = 50.

Page 25: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

K-nearest neighbours with K=10

●●

●●

●●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

x

y

●●

Use 10nearestneighbours.

Page 26: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

K-nearest neighbours with K=10

●●

●●

●●●

●●

●●

●●

●●

●●

0 20 40 60 80

45

67

89

x

y

●●

Prediction isaverage y ofthe 10nearestneighbours

Page 27: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

KNN results for K=1 and K=9

I Predictions arepiecewiseconstant

I K = 1 highvariance, lowbias

I K = 9 higherbias, lowervariance

Page 28: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

KNN vs. linear regression: Rounds 1 and 2

True function and KNN fit test error

−1.0 −0.5 0.0 0.5 1.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0.2 0.5 1.00

.00

0.0

20

.04

0.0

60

.08

Me

an

Sq

ua

red

Err

or

−1.0 −0.5 0.0 0.5 1.0

1.0

1.5

2.0

2.5

3.0

3.5

0.2 0.5 1.0

0.0

00

.05

0.1

00

.15

Me

an

Sq

ua

red

Err

or

yy

x

x

1/K

1/K

Near-linear: KNNcan do as well asregression

Non-linear: KNNbeats regression

Note use of 1/K as “flexibility”axis: small K ⇒ more flexibility

Page 29: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

KNN vs. linear regression: Round 3

True function = function of x1 only, with additional irrelevantpredictors.

Below: MSE vs. flexibility (1/k) as dimension p increases.

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=1

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=2

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=3

0.2 0.5 1.00

.00

.20

.40

.60

.81

.0

p=4

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=10

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=20

Mean S

quare

d E

rror

1/K

KNN fails with many irrelevant predictors.... This is the curse of dimensionality.

Page 30: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

The curse of dimensionality

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

−3 −2 −1 0 1 2

−3

−2

−1

01

2

p = 2 dimensions

x1

x2

p = 2 dimensions

interpoint distances

Fre

quen

cy0 1 2 3 4 5

020

040

060

080

010

00

p = 10 dimensions

interpoint distances

Fre

quen

cy

0 2 4 6 8

020

040

060

080

010

00

p = 100 dimensions

interpoint distances

Fre

quen

cy

0 5 10 15

020

040

060

080

0

I SimulateindependentN(0,1) data inp dimensions.

I Calculate allinterpointdistances.

I In highdimensions, allpoints are farapart.

Page 31: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

KNN vs. linear regression

Remember the basic model

y = f (x) + ε

Linear regression:

I Makes strong assumptions about f (x): linearity, additivity

I Also assumes a probability model for error ε.

I Has “flexibility parameter(s)” (e.g., polynomial degree)

KNN:

I Makes no assumptions about f (x) or error ε.

I Has a “flexibility parameter” (k neighbours).

Page 32: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Choosing model flexibility

What model is best? What flexibility parameter to choose? Itdepends on...

I True function f (x)

I Noise level

I Training set sample size

I Dimensionality of the input space

I ...

How do you choose?I Our “test set” in examples was stylized

I Shouldn’t extra observations be used to train the model?

I Related and more realistic approach: Cross-validation.

I For models that make stronger assumptions, inferentialmethods are available.

Page 33: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Interlude

Before discussing cross-validation, I’ll answer the unasked question:

Hugh, have you no shame? 50 points with a single predictoris not “big data” or “statistical learning”! And I think Ilearned KNN in preschool!

Maybe not, but:

I The bias-variance trade-off is central to statistical learning

I Most models use some combination of strong assumptions(linear model) and local modelling (knn)

I Can I send my kids to your preschool?

I By the way, I lied about using “polynomial regression”.Smoothing splines were actually used.

Page 34: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Outline

Introduction

Cartoon examples of supervised learning

Cross-validation and computational methods for inference

Classification

Big ideas in statistical learning

Some thoughts

Page 35: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Cross-Validation

A problem with the “test set” idea described earlier: It’s wastefulto not use all your data to train a model.

Idea #1: Train on 80%, test on 20%

I 80% of the data will resemble the full dataset.

Another problem: Randomness of data splitting and small test setleads to noisy results.Idea #2: Repeat idea #1, for different splits of the data.

I Repetition reduces variation due to random splitting.

I This is 5-fold cross-validation.

Page 36: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Picture of 5-fold CV !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!

I White box = (sideways) data matrix with n observations.I In each of 5 folds (coloured rows) ...

I Train on blue 80%I Test on beige 20%

I ... Then average the results over the 5 “folds”.

I ... Once you’ve chosen your flexibility parameter (e.g. k inKNN), use 100% of the data to retrain and make predictions.

Page 37: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Cross-Validation approximates the test error

I The actual test error can only be known with an infinitenumber of test observations.

I CV approximates this.

I For the 1-dimensional polynomial regression problems, the CVcurve is a decent approximation to the true (blue) curve.

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Flexibility

Mean S

quare

d E

rror

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Flexibility

Mean S

quare

d E

rror

2 5 10 200

510

15

20

Flexibility

Mean S

quare

d E

rror

Page 38: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

But what about statistical inference?

Remember the basic model

y = f (x) + ε

I CV helps us find a good estimate of f (x).

I But all we get is a point estimate. We don’t get uncertainty(e.g. prediction intervals).

I Inferential methods in Statistics can effectively provideuncertainty quantification.

I Easiest for simple models, in which parameter estimates arelinear functions of the data (e.g. linear regression).

Page 39: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Inference for complex models: Bootstrap

I (Freqentist) Inference: Under repeated sampling of trainingsets from the population, how does my estimator behave?

I If we could sample multiple training sets, we could directlycalculate an estimator’s distribution.

I But we can’t.

I Bootstrap: Pretend the training sample is the population.Resample with replacement a pseudo-training-sample of thesame size, and apply your estimator to it. Repeat.

Page 40: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Inference for complex models: Bootstrap

2.8 5.3 3

1.1 2.1 2

2.4 4.3 1

Y X Obs

2.8 5.3 3

2.4 4.3 1

2.8 5.3 3

Y X Obs

2.4 4.3 1

2.8 5.3 3

1.1 2.1 2

Y X Obs

2.4 4.3 1

1.1 2.1 2

1.1 2.1 2

Y X Obs

Original Data (Z)

1*Z

2*Z

Z*B

1*α

2*α

α*B

!!

!!

!!

!!

!

!!

!!

!!

!!

!!

!!

!!

!!

Page 41: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Inference for complex models: Bootstrap

Big data: If we can’t analyze the full data, how can we analyzehundreds of similar-sized bootstrap resamplings?

I “Bag of little bootstraps” by Kleiner, Talwalkar, Sarkar andJordan (JRSS-B 2014)

I Approximates the bootstrap using faster computation(subsampling and reweighting).

Page 42: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Inference for complex models: Bayes

I Bayesian methods treat all unknown parameters as randomvariables.

I Convenient mechanism to quantify uncertainty for “tuningparameters”, such as order of polynomial, k in KNN, etc.

I Posterior distributions combine data (likelihood) and priorbelief, giving full inference.

I Computation typically carried out by simulation (Markovchain Monte Carlo, MCMC).

I MCMC makes it easy to compute inferential statements forarbitrary functions of parameters.

I As with the Bootstrap, big data is challenging (see“Consensus Bayes” talk by Steve Scott).

Page 43: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Outline

Introduction

Cartoon examples of supervised learning

Cross-validation and computational methods for inference

Classification

Big ideas in statistical learning

Some thoughts

Page 44: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Classification

Y is a category (e.g. 2 categories - orange / purple).Example with two-dimensional input x = (x1, x2):

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

Pr(Y = orange |X ) is a function like f (x), and includes a randomerror model.

Page 45: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Classification

KNN with K = 10 does quite well:

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

KNN: K=10

Page 46: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Classification

A test set or CV can be used to choose flexibility (e.g. K ).

I Similar bias/variance issues.

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.0

00.0

50.1

00.1

50.2

0

1/K

Err

or

Rate

Training Errors

Test Errors

Page 47: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Outline

Introduction

Cartoon examples of supervised learning

Cross-validation and computational methods for inference

Classification

Big ideas in statistical learning

Some thoughts

Page 48: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Big ideas: additive models

Strong assumption of linear regression: Effect of varying x1 doesnot depend on value of other x ’s.

Y = β0 + β1x1 + β2x2 + . . . βpxp + ε

Generalize to have additive model with univariate functions:

Y = β0 + g1(x1) + g2(x2) + . . . gp(xp) + ε

I Retains ease of interpretation.

I Estimation of p separate univariate functions much easierthan estimation of a single f (x1, x2, . . . , xp).

I Extension: allow some low-order interactions

Page 49: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Big ideas: variable selection

With many predictors, we may expect many βj = 0. But whichones?

Y = β0 + β1x1 + β2x2 + . . . βpxp + ε

Replace usual least squares criterion

minimizen∑

i=1

(yi − yi )2 over β0 . . . , βp

with a penalized version (Lasso, Tibshirani 1996)

minimizen∑

i=1

(yi − yi )2 + λ

p∑j=1

|βj | over β0 . . . , βp

Second term constrains β’s to be small or zero.See Richard Lockhart’s talk on inference....

Page 50: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Big ideas: dimension reduction

y = f (g(x)) + ε

I The function g maps a high-dimensional input vector x to alower-dimensional space.

I What’s the point? Isn’t f (g(x)) just another function h(x)?I Idea is to estimate g without over-training.

I Principal component analysis seeks projections αT1 x , αT

2 x , ...with maximal variance. These are estimated without using Y(i.e. unsupervised learning).

I Example: digit recognition x1 = intensity of (1,1) pixel ofimage, etc. Functions g(x) of the pixels should capturestructure of the handwritten digits.

I Similar approach in “deep learning”: estimate functions ofinputs without using the response until the final learning step.

Page 51: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Big ideas: neural nets

Nonlinear models with linear regressions at their core...

They have the functional form

f (x) = Ψ

α0 +∑i

αiΦ(βi0 +∑j

βijxj)

with Ψ,Φ known, nonlinear functions.

I We seek to estimate the coefficients (β’s and α’s).

I Nonlinear regression with many parameters.

A linear combination of...A nonlinear transformation of ...

A linear combination of ...the original variables

Page 52: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Big ideas: decision treesRecursively partition the X space into rectangular regions.

Example: Predict (log) Salary of baseball player, given Years inmajor leagues and Hits made last year.

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

Years

Hits

1

117.5

238

1 4.5 24

R1

R3

R2

I Notice the “local structure” like KNN (in some dimensions).

I We must learn the tree topology (variables used, split values,etc) and outputs from training data.

Page 53: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Big ideas: decision trees

Decision trees are interpretable, flexible, good at detectinginteractions and automatically select variables.

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

YearsH

its

1

117.5

238

1 4.5 24

R1

R3

R2

But they’re sensitive to noise and terrible at representing additivestructure (try fitting y = β0 + β1x1 + β2x2 + β3x3 with a tree).

Page 54: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Big ideas: ensemble models

Overcome the limitations of a single tree by fitting a “sum oftrees” model.

I Let (T1,M1), . . . , (Tm,Mm) identify a set of m trees and theirterminal node µ’s.

Y = g(x ; T1,M1) + g(x ; T2,M2) + . . .+ g(x ; Tm,Mm) + ε

I For an input value x , each g(x ; Ti ,Mi ) outputs acorresponding µ

I The prediction is the sum of the µ’s

I Random Forests (Breiman 2001) and Boosting (Freund &Schapire 1997) are two algorithms for building this model.

Page 55: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Big ideas: ensemble models

Breiman’s random forests (2001) use randomized search and thebootstrap to perturb individual trees.

I Uses noise sensitivity of trees to build a stable model.

Freund and Schapire’s boosting algorithm (1997) encourageseach tree to fit structure not captured by the other trees.

I Enables an additive model to be fit.

I Friedman (2001) presents a more statistically motivatedboosting algorithm.

The model

Y = g(x ; T1,M1) + g(x ; T2,M2) + . . .+ g(x ; Tm,Mm) + ε

also forms the basis for Bayesian Additive Regression Trees(BART; Chipman George and McCulloch 2010).

I Full Bayesian inference + extensible error models.

Page 56: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Big ideas: support vector machines

Originated as a 2-class classification problem (Vapnik, 1996).Approach: find a hyperplane that separates the input space intotwo regions, maximally separating two classes.

−1 0 1 2 3

−1

01

23

X1

X2

Page 57: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Big ideas: support vector machines

Two other key ideas:

1. Allow some misclassifications (amount is a tuning parameter).

2. Transform input vector X into a higher-dimensional spacewhere a hyperplane is more likely to separate classes (often aparametrized transformation).

Comments on point 2:

I A “kernel trick” avoids the need to actually compute thehigh-dimensional mapping.

I Expensive algorithm - O(n2) for n observations.

SVM is one of many Kernel methods for learning.

Page 58: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Outline

Introduction

Cartoon examples of supervised learning

Cross-validation and computational methods for inference

Classification

Big ideas in statistical learning

Some thoughts

Page 59: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Some thoughts

Rich error distributions: A soon-to-be big idea?

y = f (x) + ε

We’ve focused mostly on estimating f (x).“Traditional” statistics puts more into the error model:

I time series and spatial data have correlated errors

I mixed models have multilevel error structure, includinglongitudinal data

I survey sampling has variances induced by the sampling plan

Page 60: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Some thoughts

Uncertainty quantificationMichael Jordan: “We have to have error bars around all ourpredictions. That is something that’s missing in much of thecurrent machine learning literature. ”

Huh? With big data, won’t all your error bars be 0?

Not necessarily:

I Complexity often grows with sample size: with thousands ofvariables, there will still be uncertainty.

I As large samples drive down sampling variation, other sourceof sample error gain prominence: biased sampling, correlatederrors, etc.

Page 61: An Overview of Statistical Learninghchipman/talks/BigDataBootCamp2015.pdf · An Overview of Statistical Learning Hugh Chipman Acadia University January 12, 2015 Some of the gures

Some thoughts: Summary

Key ideas:

I Bias/variance trade-off

I Cross-validation to choose flexibility

I Inference is possible (and under-appreciated)I Fancy methods try to introduce assumptions in a way that

they’re flexible:I variable selection / dimension reductionI additivity and low-dimensional functionsI transformations

I There’s a lot of room to insert statistical thinking intostatistical and machine learning.