introduction to machine learning - amazon s3 · 2016-05-04 · introduction to machine learning...
TRANSCRIPT
INTRODUCTION TO MACHINE LEARNING
Measuring model performance or error
Introduction to Machine Learning
Is our model any good?● Context of task
● Accuracy
● Computation time
● Interpretability
● 3 types of tasks
● Classification
● Regression
● Clustering
Introduction to Machine Learning
Classification● Accuracy and Error
● System is right or wrong
● Accuracy goes up when Error goes down
Accuracy =correctly classified instances
total amount of classified instances
Error = 1 - Accuracy
Introduction to Machine Learning
Example● Squares with 2 features: small/big and solid/do!ed
● Label: colored/not colored
● Binary classification problem
Introduction to Machine Learning
Example
✔ ✔ ✔
✔ ✔ ✔ ✘ ✘
= =35
60%
Truth
Predicted
✔✔✔ ✘ ✘
Introduction to Machine Learning
Example
✔ ✔ ✔
✔ ✔ ✔ ✘ ✘
= =3
560%
Truth
Predicted
✔✔✔ ✘ ✘
Introduction to Machine Learning
Limits of accuracy● Classifying very rare heart disease
● Classify all as negative (not sick)
● Predict 99 correct (not sick) and miss 1
● Accuracy: 99%
● Bogus… you miss every positive case!
Introduction to Machine Learning
Confusion matrix● Rows and columns contain all available labels
● Each cell contains frequency of instances that are classified in a certain way
Introduction to Machine Learning
Confusion matrix● Binary classifier: positive or negative (1 or 0)
Prediction
P N
Truthp TP FN
n FP TN
Introduction to Machine Learning
Confusion matrix
Prediction
P N
Truthp TP FN
n FP TN
True Positives Prediction: P
Truth: P
● Binary classifier: positive or negative (1 or 0)
Introduction to Machine Learning
Confusion matrix
Prediction
P N
Truthp TP FN
n FP TN
True Negatives Prediction: N
Truth: N
● Binary classifier: positive or negative (1 or 0)
Introduction to Machine Learning
Confusion matrix● Binary classifier: positive or negative (1 or 0)
Prediction
P N
Truthp TP FN
n FP TN
False Negatives Prediction: N
Truth: P
Introduction to Machine Learning
● Binary classifier: positive or negative (1 or 0)
Confusion matrix
Prediction
P N
Truthp TP FN
n FP TN
False Positives Prediction: P
Truth: N
Introduction to Machine Learning
● Accuracy
● Precision
● Recall
Ratios in the confusion matrix
Prediction
P N
Truthp TP FN
n FP TN
Introduction to Machine Learning
Prediction
P N
Truthp TP FN
n FP TN
Ratios in the confusion matrix
Precision TP/(TP+FP)
● Accuracy
● Precision
● Recall
Introduction to Machine Learning
Prediction
P N
Truthp TP FN
n FP TN
Ratios in the confusion matrix
Precision TP/(TP+FP)
● Accuracy
● Precision
● Recall
Introduction to Machine Learning
Prediction
P N
Truthp TP FN
n FP TN
Ratios in the confusion matrix
Recall TP/(TP+FN)
● Accuracy
● Precision
● Recall
Introduction to Machine Learning
Prediction
P N
Truthp TP FN
n FP TN
Ratios in the confusion matrix
Recall TP/(TP+FN)
● Accuracy
● Precision
● Recall
Introduction to Machine Learning
Back to the squares
Prediction
P N
Truthp 1 1
n 1 2
Truth
Predicted
Introduction to Machine Learning
Back to the squares
Prediction
P N
Truthp 1 1
n 1 2
Truth
Predicted
✔
Introduction to Machine Learning
Back to the squares
Prediction
P N
Truthp 1 1
n 1 2
Truth
Predicted
✔✔
Introduction to Machine Learning
Back to the squares
Prediction
P N
Truthp 1 1
n 1 2
Truth
Predicted
✘
Introduction to Machine Learning
Back to the squares
Prediction
P N
Truthp 1 1
n 1 2
Truth
Predicted
✘
Introduction to Machine Learning
Back to the squares● Accuracy: (TP+TN)/(TP+FP+FN+TN) = (1+2)/(1+2+1+1) = 60%
● Precision: TP/(TP+FP) = 1/(1+1) = 50%
● Recall: TP/(TP+FN) = 1/(1+1) = 50%
Prediction
P N
Truthp 1 1
n 1 2
Truth
Predicted
✔✔✔
Introduction to Machine Learning
Back to the squares● Accuracy: (TP+TN)/(TP+FP+FN+TN) = (1+2)/(1+2+1+1) = 60%
● Precision: TP/(TP+FP) = 1/(1+1) = 50%
● Recall: TP/(TP+FN) = 1/(1+1) = 50%
Prediction
P N
Truthp 1 1
n 1 2
Truth
Predicted
✔✔✔ ✘ ✘
Introduction to Machine Learning
Back to the squares● Accuracy: (TP+TN)/(TP+FP+FN+TN) = (1+2)/(1+2+1+1) = 60%
● Precision: TP/(TP+FP) = 1/(1+1) = 50%
● Recall: TP/(TP+FN) = 1/(1+1) = 50%
Prediction
P N
Truthp 1 1
n 1 2
Truth
Predicted
✔✘
Introduction to Machine Learning
Back to the squares
Prediction
P N
Truthp 1 1
n 1 2
Truth
Predicted
✔
● Accuracy: (TP+TN)/(TP+FP+FN+TN) = (1+2)/(1+2+1+1) = 60%
● Precision: TP/(TP+FP) = 1/(1+1) = 50%
● Recall: TP/(TP+FN) = 1/(1+1) = 50%
✘
Introduction to Machine Learning
Rare heart disease● Accuracy: 99/(99+1) = 99%
● Recall: 0/1 = 0%
● Precision: undefined — no positive predictions
Prediction
P N
Truthp 0 1
n 0 99
Introduction to Machine Learning
Regression: RMSE● Root Mean Squared Error (RMSE)
● Mean distance between estimates and regression line
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
6 7 8 9 10 11 12
67
89
1011
12
X1
X2
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Introduction to Machine Learning
Clustering● No label information
● Need distance metric between points
Introduction to Machine Learning
Clustering● Performance measure consists of 2 elements
● Similarity within each cluster
● Similarity between clusters
Introduction to Machine Learning
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
● ●
●
●
●
●●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
−5 0 5 10
−50
510
X1
X2
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●●● ●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Within cluster similarity● Within sum of squares (WSS)
● Diameter
● Minimize
Introduction to Machine Learning
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
● ●
●
●
●
●●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
−5 0 5 10
−50
510
X1
X2
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●●● ●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Between cluster similarity● Between cluster sum of squares (BSS)
● Intercluster distance
● Maximize
Introduction to Machine Learning
Dunn’s index
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
● ●
●
●
●
●●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
−5 0 5 10
−50
510
X1
X2
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●●● ●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
minimal intercluster distancemaximal diameter
INTRODUCTION TO MACHINE LEARNING
Let’s practice!
INTRODUCTION TO MACHINE LEARNING
Training set and test set
Introduction to Machine Learning
Machine learning - statistics● Predictive power vs. descriptive power
● Supervised learning: model must predict
● unseen observations
● Classical statistics: model must fit data
● explain or describe data
Introduction to Machine Learning
Predictive model● Training
● not on complete dataset
● training set
● Test set to evaluate performance of model
● Sets are disjoint: NO OVERLAP
● Model tested on unseen observations -> Generalization!
Introduction to Machine Learning
Split the dataset ● N instances (=observations): X
● K features: F
● Class labels: y
f1 f2 … fK yx1 x1,1 x1,2 … x1,K y1x2 x2,1 x2,2 … x2,K y2… … … … … …xr xr,1 xr,2 … xr,K yrxr+1 xr+1,1 xr+1,2 … xr+1,K yr+1xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2… … … … … …xN xN,1 xN,2 … xN,K yN
Training set
Test set
Introduction to Machine Learning
Split the dataset ● N instances (=observations): X
● K features: F
● Class labels: y
f1 f2 … fK yx1 x1,1 x1,2 … x1,K y1x2 x2,1 x2,2 … x2,K y2… … … … … …xr xr,1 xr,2 … xr,K yrxr+1 xr+1,1 xr+1,2 … xr+1,K yr+1xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2… … … … … …xN xN,1 xN,2 … xN,K yN
Test set
Training set
Introduction to Machine Learning
Split the dataset ● N instances (=observations): X
● K features: F
● Class labels: y
f1 f2 … fK yx1 x1,1 x1,2 … x1,K y1x2 x2,1 x2,2 … x2,K y2… … … … … …xr xr,1 xr,2 … xr,K yrxr+1 xr+1,1 xr+1,2 … xr+1,K yr+1xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2… … … … … …xN xN,1 xN,2 … xN,K yN
Training set
Test set
Introduction to Machine Learning
f1 f2 … fK yx1 x1,1 x1,2 … x1,K y1x2 x2,1 x2,2 … x2,K y2… … … … … …xr xr,1 xr,2 … xr,K yrxr+1 xr+1,1 xr+1,2 … xr+1,K yr+1xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2… … … … … …xN xN,1 xN,2 … xN,K yN
Split the dataset ● N instances (=observations): X
● K features: F
● Class labels: y
Training set
Test set
Introduction to Machine Learning
Split the datasetf1 f2 … fK y
x1 x1,1 x1,2 … x1,K y1x2 x2,1 x2,2 … x2,K y2… … … … … …xr xr,1 xr,2 … xr,K yrxr+1 xr+1,1 xr+1,2 … xr+1,K yr+1xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2… … … … … …xN xN,1 xN,2 … xN,K yN
Training set
Test set
Introduction to Machine Learning
Split the datasetf1 f2 … fK y
x1 x1,1 x1,2 … x1,K y1
x2 x2,1 x2,2 … x2,K y2
… … … … … …xr xr,1 xr,2 … xr,K yr
xr+1 xr+1,1 xr+1,2 … xr+1,K yr+1
xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2
… … … … … …xN xN,1 xN,2 … xN,K yN
Use to predict y: ŷ
Training set
Test set
Introduction to Machine Learning
Split the datasetf1 f2 … fK y
x1 x1,1 x1,2 … x1,K y1
x2 x2,1 x2,2 … x2,K y2
… … … … … …xr xr,1 xr,2 … xr,K yr
xr+1 xr+1,1 xr+1,2 … xr+1,K yr+1
xr+2 xr+2,1 xr+2,2 … xr+2,K yr+2
… … … … … …xN xN,1 xN,2 … xN,K yN
Use to predict y: ŷ real ycompare them
Training set
Test set
Introduction to Machine Learning
When to use training/test set?● Supervised learning
● Not for unsupervised (clustering)
● Data not labeled
Introduction to Machine Learning
Predictive power of model
Train model Test model
Training set Test set
Performance measure Predictive powerUse model
Introduction to Machine Learning
How to split the sets?● Which observations go where?
● Training set larger test set
● Typically about 3/1
● Quite arbitrary
● Generally: more data = be!er model
● Test set not too small
Introduction to Machine Learning
Distribution of the sets● Classification
● classes must have similar distributions
● avoid a class not being available in a set
● Classification & regression
● shuffle dataset before spli"ing
Introduction to Machine Learning
Effect of sampling● Sampling can affect performance measures
● Add robustness to these measures: cross-validation
● Idea: sample multiple times, with different separations
Introduction to Machine Learning
Cross-validation
Test set
Test set
Test set
Test set
Training set
Training set
Training set
Training set
● E.g.: 4-fold cross-validation
Introduction to Machine Learning
Cross-validation● E.g.: 4-fold cross-validation
Test set
Test set
Test set
Test set
Training set
Training set
Training set
Training set
Introduction to Machine Learning
Cross-validation● E.g.: 4-fold cross-validation
Test set
Test set
Test set
Test set
Training set
Training set
Training set
Training set
Introduction to Machine Learning
Cross-validation● E.g.: 4-fold cross-validation
aggregate results for robust measure
Test set
Test set
Test set
Test set
Training set
Training set
Training set
Training set
Introduction to Machine Learning
n-fold cross-validation● Fold test set over dataset n times
● Each test set is 1/n size of total dataset
INTRODUCTION TO MACHINE LEARNING
Let’s practice!
INTRODUCTION TO MACHINE LEARNING
Bias and Variance
Introduction to Machine Learning
What you’ve learned?● Accuracy and other performance measures
● Training and test set
Introduction to Machine Learning
Kni!ing it all together● Effect of spli"ing dataset (train/test) on accuracy
● Over- and underfi"ing
Introduction to Machine Learning
Introducing
BIAS VARIANCE
Introduction to Machine Learning
Bias and Variance● Main goal of supervised learning: prediction
● Prediction error ~ reducible + irreducible error
Introduction to Machine Learning
Irreducible - reducible error● Irreducible: noise — don’t minimize
● Reducible: error due to unfit model — minimize
● Reducible error is split into bias and variance
Introduction to Machine Learning
Bias● Error due to bias: wrong assumptions
● Difference predictions and truth
● using models trained by specific learning algorithm
Introduction to Machine Learning
Example
Introduction to Machine Learning
Example● Quadratic data
Introduction to Machine Learning
Example● Quadratic data
● Assumption: data is linear — use linear regression
Introduction to Machine Learning
Example● Quadratic data
● Assumption: data is linear — use linear regression
●
● Error due to bias is high: more restrictions on model
Introduction to Machine Learning
Bias● Complexity of model
● More restrictions lead to high bias
Introduction to Machine Learning
Variance● Error due to variance: error due to the sampling of the
training set
● Model with high variance fits training set closely
Introduction to Machine Learning
Example● Quadratic data
● Few restrictions: fit polynomial perfectly through training set
● If you change training set, model will change completely
high variance : generalizes bad to test set
Introduction to Machine Learning
Bias-variance tradeoff
BIAS VARIANCE
low variance - high biaslow bias - high variance
Introduction to Machine Learning
Overfi!ing● Accuracy will depend on dataset split (train/test)
● High variance will heavily depend on split
● Overfi!ing = model fits training set a lot be!er than test set
● Too specific
Introduction to Machine Learning
Underfi!ing● Restricting your model too much
● High bias
● Too general
Introduction to Machine Learning
Example - spam or not?
Truth
A lot of capital letters?yes
A lot of exclamation marks?yes
spam
Emails training setcapital letters
exclamation marks
no spam
no spamno
no
exception with 50 capital letters
30 exclamation marksis no spam
Introduction to Machine Learning
Emails training setcapital letters
exclamation marks
exception with 50 capital letters
30 exclamation marksis no spam
no spam
yes
yes
Overfit
A lot of capital letters?
A lot of exclamation marks?yes
no spam
no spam
no
no
yes
50 capital letters? spamno
30 exclamation marks? spamno
Example - spam or not?
too specific!
Introduction to Machine Learning
Underfit
More than 10 capital letters?yes
spamEmails training set
capital letters exclamation marks
no spamno
Example - spam or not?
too general!
INTRODUCTION TO MACHINE LEARNING
Let’s practice!