lecture 8: machine learning in practice (1)
TRANSCRIPT
![Page 1: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/1.jpg)
Machine Learning for Language Technology 2015 h6p://stp.lingfil.uu.se/~san?nim/ml/2015/ml4lt_2015.htm
Machine Learning in Prac-ce (1)
Marina San-ni
Department of Linguis-cs and Philology Uppsala University, Uppsala, Sweden
Autumn 2015
![Page 2: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/2.jpg)
Acknowledgements
• Weka’s slides • WiHen et al. (2011): Ch 5 (156-‐180) • Daume’ III (2015): ch 4 pp. 65-‐67.
Lecture 8 ML in Practice (1) 2
![Page 3: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/3.jpg)
Outline
l Comparing schemes: the t-‐test l Predic-ng probabili-es l Cost-‐sensi-ve measures l Occam’s razor
Lecture 8 ML in Practice (1) 3
![Page 4: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/4.jpg)
4 Lecture 8 ML in Practice (1)
Comparing data mining schemes
l Frequent question: which of two learning schemes performs better?
l Note: this is domain dependent! l Obvious way: compare 10-fold CV estimates l Generally sufficient in applications (we don't loose
if the chosen method is not truly better) l However, what about machine learning research?
♦ Need to show convincingly that a particular method works better
![Page 5: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/5.jpg)
5 Lecture 8 ML in Practice (1)
Comparing schemes II l Want to show that scheme A is beHer than scheme B in a par-cular domain
♦ For a given amount of training data ♦ On average, across all possible training sets
l Let's assume we have an infinite amount of data from the domain:
♦ Sample infinitely many dataset of specified size ♦ Obtain cross-‐valida-on es-mate on each dataset for each scheme
♦ Check if mean accuracy for scheme A is beHer than mean accuracy for scheme B
![Page 6: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/6.jpg)
6 Lecture 8 ML in Practice (1)
Paired t-‐test l In practice we have limited data and a limited number of
estimates for computing the mean l Student’s t-test tells whether the means of two samples
are significantly different l In our case the samples are cross-validation estimates
for different datasets from the domain l Use a paired t-test because the individual samples are
paired ♦ The same CV is applied twice
William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Obtained a post as a chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".
![Page 7: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/7.jpg)
7 Lecture 8 ML in Practice (1)
Distribu-on of the means l x1 x2 … xk l y1 y2 … yk
l mx and my are the means
l With enough samples, the mean of a set of independent samples is normally distributed
l Estimated variances of the means are σx
2/k and σy2/k
l If µx and µy are the true means then à à à
are approximately normally distributed with mean 0, variance 1
![Page 8: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/8.jpg)
8 Lecture 8 ML in Practice (1)
Student’s distribu-on
l With small samples (k < 100) the mean follows Student’s distribution with k–1 degrees of freedom
l Confidence limits:
0.88 20%
1.38 10%
1.83 5%
2.82
3.25
4.30
z
1%
0.5%
0.1%
Pr[X ≥ z]
0.84 20%
1.28 10%
1.65 5%
2.33
2.58
3.09
z
1%
0.5%
0.1%
Pr[X ≥ z]
9 degrees of freedom normal distribution
Assuming we have 10 estimates
![Page 9: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/9.jpg)
9 Lecture 8 ML in Practice (1)
Distribu-on of the differences l Let md = mx – my
l The difference of the means (md) also has a Student’s distribution with k–1 degrees of freedom
l The standardized version of md is called the t-statistic: ….
l We use t to perform the t-test
l σd2 = the variance of the difference samples
![Page 10: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/10.jpg)
10 Lecture 8 ML in Practice (1)
Performing the test
• Fix a significance level • If a difference is significant at the α% level,
there is a (100-α)% chance that the true means differ • Divide the significance level by two because the test
is two-tailed • i.e. the true difference can be +ve or – ve
• Look up the value for z that corresponds to α/2 • If t ≤ –z or t ≥z then the difference is significant • I.e. the null hypothesis (that the difference is zero) can be
rejected
![Page 11: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/11.jpg)
11 Lecture 8 ML in Practice (1)
Unpaired observa-ons l If the CV estimates are from different
datasets, they are no longer paired (or maybe we have k estimates for one scheme, and j estimates for the other one)
l Then we have to use an un paired t-test with min(k , j) – 1 degrees of freedom
l The estimate of the variance of the difference of the means becomes….:
![Page 12: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/12.jpg)
12 Lecture 8 ML in Practice (1)
Predic-ng probabili-es l Performance measure so far: success rate l Also called 0-1 loss function: l Most classifiers produces class probabilities l Depending on the application, we might want to
check the accuracy of the probability estimates l 0-1 loss is not the right thing to use in those cases
∑ i {0 if prediction is correct1 if prediction is incorrect
}
![Page 13: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/13.jpg)
13 Lecture 8 ML in Practice (1)
Quadra-c loss func-on l p1 … pk are probability estimates for an
instance
l c is the index of the instance’s actual class
l a1 … ak = 0, except for ac which is 1
l Quadratic loss is:……
l Want to minimize…..
![Page 14: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/14.jpg)
14 Lecture 8 ML in Practice (1)
Informa-onal loss func-on l The informational loss function is –log(pc),
where c is the index of the instance’s actual class l Let p1
* … pk*
be the true class probabilities l Then the expected value for the loss function is:
![Page 15: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/15.jpg)
15 Lecture 8 ML in Practice (1)
Discussion l Which loss function to choose?
♦ Quadratic loss function takes into account all class probability estimates for an instance
♦ Informational loss focuses only on the probability estimate for the actual class 1义∑ j p j
2
![Page 16: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/16.jpg)
16 Lecture 8 ML in Practice (1)
The kappa sta-s-c l Two confusion matrices for a 3-‐class problem: actual predic-ons (le]) vs. random predic-ons (right)
l Number of successes: sum of entries in diagonal (D) l Kappa sta-s-c: measures rela-ve improvement over random predic-ons
D obs e rve d− D random
D perfe ct− D random
![Page 17: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/17.jpg)
K sta-s-c: Calcula-ons • Propor-ons of the class ”a” = 0.5 (ie 100 instances out of 200 à 50% à 50/100 à 0.5) • Propor-ons of the class ”b” = 0.3 (ie 60 instances out of 200 à 30% à 30/100 à 0.3) • Propor-ons of the class ”c” = 0.2 (ie 40 instances out of 200 à 20% à 20/100 à 0.2) Both classifiers (see below) returns 120 a’s, 60 b’s and 20 c’s, but one classifier is random. How much the actual classifier improves on the random classifier? A classifier randomly guessing would return the predic-ons in the table on the RHS: 0.5*120=60; 0.3*60=18; 0.2*20=4 à 60+18+4 = 82 The actual classifier returns the predic-ons in the table on the LHS, 140 correct predic-ons (see diagonal), ie 70% success rate. However: k sta$s$c = 140-‐82/200-‐82 = 58/118=0.49=49% • So the actual success rate of 70% repesents an improvement of 49% on random guessing!
Lecture 8 ML in Practice (1) 17
D obs e rve d− D random
D perfe ct− D random
actual predictions (left) vs. random predictions (right)
![Page 18: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/18.jpg)
In summary
• A k sta-s-c of 100% (or 1) implies a perfect classifier. • A k sta-s-c of 0 implies that the classifier provides no informa-on and behaves as if it were guessing randomly.
• The Kappa sta-s-c is used to measure the agreement between predicted and observed categoriza-ons of a dataset, and corrects the agreement that occurs by chance.
• Weka provides the k sta-s-c value to assess the success rate beyond the chance.
Lecture 8 ML in Practice (1) 18
![Page 19: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/19.jpg)
Quiz 1: k sta-s-c Our classifier predicts Red 41 -mes, Green 29 -mes and Blue 30 -mes. The actual numbers for the sample are: 40 Red, 30 Green and 30 Blue. Overall, our classifier is right 70% of the -me. Suppose these predic$ons had been random guesses. Our classifier have been randomly right: 0.4 x 41 + 0.3 x 29 + 0.3 x 30 = 34.1 (random guess) So the actual success rate of 70% represents an improvement of 35.9% on random guessing. What is the k sta-s-c for our classifier? 1. 0.54 2. 0.60 3. 0.70 Lecture 8 ML in Practice (1) 19
![Page 20: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/20.jpg)
20 Lecture 8 ML in Practice (1)
Coun-ng the cost l In practice, different types of classification
errors often incur different costs l Examples:
♦ Promotional mailing ♦ Terrorist profiling
l “Not a terrorist” correct 99.99% of the time, but if you miss 0.01% the cost will be very high
♦ Loan decisions ♦ etc.
l There are many other types of cost! l E.g.: cost of collecting training data
![Page 21: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/21.jpg)
21 Lecture 8 ML in Practice (1)
Coun-ng the cost
l The confusion matrix:
Actual class
True negative False positive No
False negative True positive Yes
No Yes
Predicted class
![Page 22: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/22.jpg)
22 Lecture 8 ML in Practice (1)
Classifica-on with costs l Two cost matrices:
l Success rate is replaced by average cost per predic-on
♦ Cost is given by appropriate entry in the cost matrix
![Page 23: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/23.jpg)
23 Lecture 8 ML in Practice (1)
Cost-‐sensi-ve classifica-on l Can take costs into account when making predic-ons
♦ Basic idea: only predict high-‐cost class when very confident about predic-on
l Given: predicted class probabili-es ♦ Normally we just predict the most likely class ♦ Here, we should make the predic-on that minimizes the expected cost
l Expected cost: dot product of vector of class probabili-es and appropriate column in cost matrix
l Choose column (class) that minimizes expected cost
![Page 24: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/24.jpg)
24 Lecture 8 ML in Practice (1)
Cost-‐sensi-ve learning
l So far we haven't taken costs into account at training time
l Most learning schemes do not perform cost-sensitive learning l They generate the same classifier no matter what
costs are assigned to the different classes l Example: standard decision tree learner
l Simple methods for cost-sensitive learning: l Resampling of instances according to costs l Weighting of instances according to costs
l Some schemes can take costs into account by varying a parameter, e.g. naïve Bayes
![Page 25: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/25.jpg)
25 Lecture 8 ML in Practice (1)
Li] charts
l In practice, costs are rarely known l Decisions are usually made by comparing
possible scenarios l Example: promotional mailout to 1,000,000
households • Mail to all; 0.1% respond (1000) • Data mining tool identifies subset of 100,000 most
promising, 0.4% of these respond (400) 40% of responses for 10% of cost may pay off
• Identify subset of 400,000 most promising, 0.2% respond (800)
l A lift chart allows a visual comparison
![Page 26: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/26.jpg)
Data for a li] chart
Lecture 8 ML in Practice (1) 26
![Page 27: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/27.jpg)
27 Lecture 8 ML in Practice (1)
Genera-ng a li] chart
l Sort instances according to predicted probability of being positive:
l x axis is sample size
y axis is number of true positives
… … …
Yes 0.88 4
No 0.93 3
Yes 0.93 2
Yes 0.95 1
Actual class Predicted probability
![Page 28: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/28.jpg)
28 Lecture 8 ML in Practice (1)
A hypothe-cal li] chart
40% of responses for 10% of cost
80% of responses for 40% of cost
![Page 29: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/29.jpg)
29 Lecture 8 ML in Practice (1)
ROC curves
l ROC curves are similar to lift charts ♦ Stands for “receiver operating characteristic” ♦ Used in signal detection to show tradeoff
between hit rate and false alarm rate over noisy channel
l Differences to lift chart: ♦ y axis shows percentage of true positives in
sample rather than absolute number
♦ x axis shows percentage of false positives in sample rather than sample size
![Page 30: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/30.jpg)
30 Lecture 8 ML in Practice (1)
A sample ROC curve
l Jagged curve—one set of test data l Smooth curve—use cross-validation
![Page 31: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/31.jpg)
31 Lecture 8 ML in Practice (1)
Cross-‐valida-on and ROC curves
l Simple method of getting a ROC curve using cross-validation:
♦ Collect probabilities for instances in test folds ♦ Sort instances according to probabilities
l This method is implemented in WEKA l However, this is just one possibility
♦ Another possibility is to generate an ROC curve for each fold and average them
![Page 32: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/32.jpg)
32 Lecture 8 ML in Practice (1)
ROC curves for two schemes
l For a small, focused sample, use method A l For a larger one, use method B l In between, choose between A and B with appropriate probabilities
![Page 33: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/33.jpg)
33 Lecture 8 ML in Practice (1)
Recall-‐Precision Curves
l Percentage of retrieved documents that are relevant: precision=TP/(TP+FP)
l Percentage of relevant documents that are returned: recall =TP/(TP+FN)
l Precision/recall curves have hyperbolic shape l Summary measures: average precision at 20%, 50% and 80%
recall (three-point average recall) l F-measure=(2 × recall × precision)/(recall+precision) l sensitivity × specificity = (TP / (TP + FN)) × (TN / (FP + TN)) l Area under the ROC curve (AUC):
probability that randomly chosen positive instance is ranked above randomly chosen negative one
![Page 34: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/34.jpg)
34 Lecture 8 ML in Practice (1)
Model selec-on criteria l Model selection criteria attempt to find a good
compromise between: l The complexity of a model l Its prediction accuracy on the training data
l Reasoning: a good model is a simple model that achieves high accuracy on the given data
l Also known as Occam’s Razor : the best theory is the smallest one that describes all the facts
William of Ockham, born in the village of Ockham in Surrey (England) about 1285, was the most influential philosopher of the 14th century and a controversial theologian.
![Page 35: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/35.jpg)
35 Lecture 8 ML in Practice (1)
Elegance vs. errors
l Model 1: very simple, elegant model that accounts for the data almost perfectly
l Model 2: significantly more complex model that reproduces the data without mistakes
l Model 1 is probably preferable.
![Page 36: Lecture 8: Machine Learning in Practice (1)](https://reader034.vdocuments.net/reader034/viewer/2022042707/586fcf611a28aba24c8b8063/html5/thumbnails/36.jpg)
The End
Lecture 8 ML in Practice (1) 36