introduction to machine learning - amazon s3 · 2016-05-04 · introduction to machine learning...

77
INTRODUCTION TO MACHINE LEARNING Measuring model performance or error

Upload: dangkiet

Post on 12-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

INTRODUCTION TO MACHINE LEARNING

Measuring model performance or error

Page 2: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Is our model any good?● Context of task

● Accuracy

● Computation time

● Interpretability

● 3 types of tasks

● Classification

● Regression

● Clustering

Page 3: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Classification● Accuracy and Error

● System is right or wrong

● Accuracy goes up when Error goes down

Accuracy =correctly classified instances

total amount of classified instances

Error = 1 - Accuracy

Page 4: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Example● Squares with 2 features: small/big and solid/do!ed

● Label: colored/not colored

● Binary classification problem

Page 5: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Example

✔ ✔ ✔

✔ ✔ ✔ ✘ ✘

= =35

60%

Truth

Predicted

✔✔✔ ✘ ✘

Page 6: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Example

✔ ✔ ✔

✔ ✔ ✔ ✘ ✘

= =3

560%

Truth

Predicted

✔✔✔ ✘ ✘

Page 7: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Limits of accuracy● Classifying very rare heart disease

● Classify all as negative (not sick)

● Predict 99 correct (not sick) and miss 1

● Accuracy: 99%

● Bogus… you miss every positive case!

Page 8: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Confusion matrix● Rows and columns contain all available labels

● Each cell contains frequency of instances that are classified in a certain way

Page 9: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Confusion matrix● Binary classifier: positive or negative (1 or 0)

Prediction

P N

Truthp TP FN

n FP TN

Page 10: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Confusion matrix

Prediction

P N

Truthp TP FN

n FP TN

True Positives Prediction: P

Truth: P

● Binary classifier: positive or negative (1 or 0)

Page 11: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Confusion matrix

Prediction

P N

Truthp TP FN

n FP TN

True Negatives Prediction: N

Truth: N

● Binary classifier: positive or negative (1 or 0)

Page 12: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Confusion matrix● Binary classifier: positive or negative (1 or 0)

Prediction

P N

Truthp TP FN

n FP TN

False Negatives Prediction: N

Truth: P

Page 13: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

● Binary classifier: positive or negative (1 or 0)

Confusion matrix

Prediction

P N

Truthp TP FN

n FP TN

False Positives Prediction: P

Truth: N

Page 14: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

● Accuracy

● Precision

● Recall

Ratios in the confusion matrix

Prediction

P N

Truthp TP FN

n FP TN

Page 15: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Prediction

P N

Truthp TP FN

n FP TN

Ratios in the confusion matrix

Precision TP/(TP+FP)

● Accuracy

● Precision

● Recall

Page 16: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Prediction

P N

Truthp TP FN

n FP TN

Ratios in the confusion matrix

Precision TP/(TP+FP)

● Accuracy

● Precision

● Recall

Page 17: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Prediction

P N

Truthp TP FN

n FP TN

Ratios in the confusion matrix

Recall TP/(TP+FN)

● Accuracy

● Precision

● Recall

Page 18: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Prediction

P N

Truthp TP FN

n FP TN

Ratios in the confusion matrix

Recall TP/(TP+FN)

● Accuracy

● Precision

● Recall

Page 19: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Back to the squares

Prediction

P N

Truthp 1 1

n 1 2

Truth

Predicted

Page 20: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Back to the squares

Prediction

P N

Truthp 1 1

n 1 2

Truth

Predicted

Page 21: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Back to the squares

Prediction

P N

Truthp 1 1

n 1 2

Truth

Predicted

✔✔

Page 22: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Back to the squares

Prediction

P N

Truthp 1 1

n 1 2

Truth

Predicted

Page 23: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Back to the squares

Prediction

P N

Truthp 1 1

n 1 2

Truth

Predicted

Page 24: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Back to the squares● Accuracy: (TP+TN)/(TP+FP+FN+TN) = (1+2)/(1+2+1+1) = 60%

● Precision: TP/(TP+FP) = 1/(1+1) = 50%

● Recall: TP/(TP+FN) = 1/(1+1) = 50%

Prediction

P N

Truthp 1 1

n 1 2

Truth

Predicted

✔✔✔

Page 25: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Back to the squares● Accuracy: (TP+TN)/(TP+FP+FN+TN) = (1+2)/(1+2+1+1) = 60%

● Precision: TP/(TP+FP) = 1/(1+1) = 50%

● Recall: TP/(TP+FN) = 1/(1+1) = 50%

Prediction

P N

Truthp 1 1

n 1 2

Truth

Predicted

✔✔✔ ✘ ✘

Page 26: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Back to the squares● Accuracy: (TP+TN)/(TP+FP+FN+TN) = (1+2)/(1+2+1+1) = 60%

● Precision: TP/(TP+FP) = 1/(1+1) = 50%

● Recall: TP/(TP+FN) = 1/(1+1) = 50%

Prediction

P N

Truthp 1 1

n 1 2

Truth

Predicted

✔✘

Page 27: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Back to the squares

Prediction

P N

Truthp 1 1

n 1 2

Truth

Predicted

● Accuracy: (TP+TN)/(TP+FP+FN+TN) = (1+2)/(1+2+1+1) = 60%

● Precision: TP/(TP+FP) = 1/(1+1) = 50%

● Recall: TP/(TP+FN) = 1/(1+1) = 50%

Page 28: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Rare heart disease● Accuracy: 99/(99+1) = 99%

● Recall: 0/1 = 0%

● Precision: undefined — no positive predictions

Prediction

P N

Truthp 0 1

n 0 99

Page 29: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Regression: RMSE● Root Mean Squared Error (RMSE)

● Mean distance between estimates and regression line

●●

● ●

6 7 8 9 10 11 12

67

89

1011

12

X1

X2

●●

Page 30: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Clustering● No label information

● Need distance metric between points

Page 31: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Clustering● Performance measure consists of 2 elements

● Similarity within each cluster

● Similarity between clusters

Page 32: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

−5 0 5 10

−50

510

X1

X2

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●● ●●

●●

●●

●●

● ● ●

● ●

●●

●●

●●

●●

●●

●●

Within cluster similarity● Within sum of squares (WSS)

● Diameter

● Minimize

Page 33: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

−5 0 5 10

−50

510

X1

X2

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●● ●●

●●

●●

●●

● ● ●

● ●

●●

●●

●●

●●

●●

●●

Between cluster similarity● Between cluster sum of squares (BSS)

● Intercluster distance

● Maximize

Page 34: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Dunn’s index

●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

−5 0 5 10

−50

510

X1

X2

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●● ●●

●●

●●

●●

● ● ●

● ●

●●

●●

●●

●●

●●

●●

minimal intercluster distancemaximal diameter

Page 35: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

Page 36: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

INTRODUCTION TO MACHINE LEARNING

Training set and test set

Page 37: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Machine learning - statistics● Predictive power vs. descriptive power

● Supervised learning: model must predict

● unseen observations

● Classical statistics: model must fit data

● explain or describe data

Page 38: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Predictive model● Training

● not on complete dataset

● training set

● Test set to evaluate performance of model

● Sets are disjoint: NO OVERLAP

● Model tested on unseen observations -> Generalization!

Page 39: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Split the dataset ● N instances (=observations): X

● K features: F

● Class labels: y

f1 f2 … fK yx1 x1,1 x1,2 … x1,K y1x2 x2,1 x2,2 … x2,K y2… … … … … …xr xr,1 xr,2 … xr,K yrxr+1 xr+1,1 xr+1,2 … xr+1,K yr+1xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2… … … … … …xN xN,1 xN,2 … xN,K yN

Training set

Test set

Page 40: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Split the dataset ● N instances (=observations): X

● K features: F

● Class labels: y

f1 f2 … fK yx1 x1,1 x1,2 … x1,K y1x2 x2,1 x2,2 … x2,K y2… … … … … …xr xr,1 xr,2 … xr,K yrxr+1 xr+1,1 xr+1,2 … xr+1,K yr+1xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2… … … … … …xN xN,1 xN,2 … xN,K yN

Test set

Training set

Page 41: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Split the dataset ● N instances (=observations): X

● K features: F

● Class labels: y

f1 f2 … fK yx1 x1,1 x1,2 … x1,K y1x2 x2,1 x2,2 … x2,K y2… … … … … …xr xr,1 xr,2 … xr,K yrxr+1 xr+1,1 xr+1,2 … xr+1,K yr+1xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2… … … … … …xN xN,1 xN,2 … xN,K yN

Training set

Test set

Page 42: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

f1 f2 … fK yx1 x1,1 x1,2 … x1,K y1x2 x2,1 x2,2 … x2,K y2… … … … … …xr xr,1 xr,2 … xr,K yrxr+1 xr+1,1 xr+1,2 … xr+1,K yr+1xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2… … … … … …xN xN,1 xN,2 … xN,K yN

Split the dataset ● N instances (=observations): X

● K features: F

● Class labels: y

Training set

Test set

Page 43: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Split the datasetf1 f2 … fK y

x1 x1,1 x1,2 … x1,K y1x2 x2,1 x2,2 … x2,K y2… … … … … …xr xr,1 xr,2 … xr,K yrxr+1 xr+1,1 xr+1,2 … xr+1,K yr+1xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2… … … … … …xN xN,1 xN,2 … xN,K yN

Training set

Test set

Page 44: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Split the datasetf1 f2 … fK y

x1 x1,1 x1,2 … x1,K y1

x2 x2,1 x2,2 … x2,K y2

… … … … … …xr xr,1 xr,2 … xr,K yr

xr+1 xr+1,1 xr+1,2 … xr+1,K yr+1

xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2

… … … … … …xN xN,1 xN,2 … xN,K yN

Use to predict y: ŷ

Training set

Test set

Page 45: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Split the datasetf1 f2 … fK y

x1 x1,1 x1,2 … x1,K y1

x2 x2,1 x2,2 … x2,K y2

… … … … … …xr xr,1 xr,2 … xr,K yr

xr+1 xr+1,1 xr+1,2 … xr+1,K yr+1

xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2

… … … … … …xN xN,1 xN,2 … xN,K yN

Use to predict y: ŷ real ycompare them

Training set

Test set

Page 46: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

When to use training/test set?● Supervised learning

● Not for unsupervised (clustering)

● Data not labeled

Page 47: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Predictive power of model

Train model Test model

Training set Test set

Performance measure Predictive powerUse model

Page 48: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

How to split the sets?● Which observations go where?

● Training set larger test set

● Typically about 3/1

● Quite arbitrary

● Generally: more data = be!er model

● Test set not too small

Page 49: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Distribution of the sets● Classification

● classes must have similar distributions

● avoid a class not being available in a set

● Classification & regression

● shuffle dataset before spli"ing

Page 50: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Effect of sampling● Sampling can affect performance measures

● Add robustness to these measures: cross-validation

● Idea: sample multiple times, with different separations

Page 51: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Cross-validation

Test set

Test set

Test set

Test set

Training set

Training set

Training set

Training set

● E.g.: 4-fold cross-validation

Page 52: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Cross-validation● E.g.: 4-fold cross-validation

Test set

Test set

Test set

Test set

Training set

Training set

Training set

Training set

Page 53: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Cross-validation● E.g.: 4-fold cross-validation

Test set

Test set

Test set

Test set

Training set

Training set

Training set

Training set

Page 54: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Cross-validation● E.g.: 4-fold cross-validation

aggregate results for robust measure

Test set

Test set

Test set

Test set

Training set

Training set

Training set

Training set

Page 55: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

n-fold cross-validation● Fold test set over dataset n times

● Each test set is 1/n size of total dataset

Page 56: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

Page 57: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

INTRODUCTION TO MACHINE LEARNING

Bias and Variance

Page 58: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

What you’ve learned?● Accuracy and other performance measures

● Training and test set

Page 59: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Kni!ing it all together● Effect of spli"ing dataset (train/test) on accuracy

● Over- and underfi"ing

Page 60: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Introducing

BIAS VARIANCE

Page 61: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Bias and Variance● Main goal of supervised learning: prediction

● Prediction error ~ reducible + irreducible error

Page 62: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Irreducible - reducible error● Irreducible: noise — don’t minimize

● Reducible: error due to unfit model — minimize

● Reducible error is split into bias and variance

Page 63: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Bias● Error due to bias: wrong assumptions

● Difference predictions and truth

● using models trained by specific learning algorithm

Page 64: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Example

Page 65: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Example● Quadratic data

Page 66: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Example● Quadratic data

● Assumption: data is linear — use linear regression

Page 67: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Example● Quadratic data

● Assumption: data is linear — use linear regression

● Error due to bias is high: more restrictions on model

Page 68: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Bias● Complexity of model

● More restrictions lead to high bias

Page 69: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Variance● Error due to variance: error due to the sampling of the

training set

● Model with high variance fits training set closely

Page 70: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Example● Quadratic data

● Few restrictions: fit polynomial perfectly through training set

● If you change training set, model will change completely

high variance : generalizes bad to test set

Page 71: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Bias-variance tradeoff

BIAS VARIANCE

low variance - high biaslow bias - high variance

Page 72: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Overfi!ing● Accuracy will depend on dataset split (train/test)

● High variance will heavily depend on split

● Overfi!ing = model fits training set a lot be!er than test set

● Too specific

Page 73: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Underfi!ing● Restricting your model too much

● High bias

● Too general

Page 74: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Example - spam or not?

Truth

A lot of capital letters?yes

A lot of exclamation marks?yes

spam

Emails training setcapital letters

exclamation marks

no spam

no spamno

no

exception with 50 capital letters

30 exclamation marksis no spam

Page 75: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Emails training setcapital letters

exclamation marks

exception with 50 capital letters

30 exclamation marksis no spam

no spam

yes

yes

Overfit

A lot of capital letters?

A lot of exclamation marks?yes

no spam

no spam

no

no

yes

50 capital letters? spamno

30 exclamation marks? spamno

Example - spam or not?

too specific!

Page 76: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

Introduction to Machine Learning

Underfit

More than 10 capital letters?yes

spamEmails training set

capital letters exclamation marks

no spamno

Example - spam or not?

too general!

Page 77: INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning Limits of accuracy Classifying very rare heart disease Classify all as negative (not

INTRODUCTION TO MACHINE LEARNING

Let’s practice!