machine learning in practice lecture 10 carolyn penstein rosé language technologies institute/...
DESCRIPTION
Plan for the Day Announcements Questions? Quiz answer key posted Today’s Data Set: Prevalence of Gambling Exploring the Concept of CostTRANSCRIPT
![Page 1: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/1.jpg)
Machine Learning in PracticeLecture 10
Carolyn Penstein RoséLanguage Technologies Institute/
Human-Computer Interaction Institute
![Page 2: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/2.jpg)
http://www.theallusionist.com/wordpress/wp-content/uploads/gambling8.jpg
![Page 3: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/3.jpg)
Plan for the Day Announcements
Questions?Quiz answer key posted
Today’s Data Set: Prevalence of Gambling Exploring the Concept of Cost
http://www.casino-gambling-dictionary.com/
![Page 4: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/4.jpg)
Quiz Notes
![Page 5: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/5.jpg)
Leave-one-out cross validation On each fold, train on all but 1 data point,
test on 1 data pointPro: Maximizes amount of training data used
on each foldCon: Not stratifiedCon: Take a long time on large sets
Best to only use when you have a very small amount of training dataOnly needed when 10-fold cross validation is
not feasible because of lack of data
![Page 6: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/6.jpg)
632 Bootstrap A method for estimating performance when
you have a small data setConsider it an alternative to leave-one-out
cross validation Sample n times with replacement to create
the training setSome instances will be repeatedSome will be left out – this will be your test setAbout 63% of the instances in the original set
will end up in the training set
![Page 7: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/7.jpg)
632 Bootstrap Estimating error over the training set will be an
optimistic estimate of performance Because you trained on these examples
Estimating error over test set will be a pessimistic estimate of the error Because the 63/37 split gives you less training data
than 90/10 Estimate error by combining optimistic and
pessimistic estimates .63*pessimistic_estimate + .37*optimistic_estimate
Iterate several times and average performance estimates
![Page 8: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/8.jpg)
Prevalence of Gambling
![Page 9: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/9.jpg)
Gambling Prevalence Goal is to predict how often
people… who fit in a particular
demographic group i.e., male versus female, white
versus black versus hispanic versus other
are classified as having a particular level of gambling risk
At risk, problem, or Pathalogic either during one specific year
or in their lifetime
![Page 10: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/10.jpg)
Gambling Prevalence Goal is to predict how often
people… who fit in a particular
demographic group i.e., male versus female, white
versus black versus hispanic versus other
are classified as having a particular level of gambling risk
At risk, problem, or Pathalogic either during one specific year
or in their lifetime
![Page 11: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/11.jpg)
Gambling Prevalence Goal is to predict how often
people… who fit in a particular
demographic group i.e., male versus female, white
versus black versus hispanic versus other
are classified as having a particular level of gambling risk
At risk, problem, or Pathalogic either during one specific year
or in their lifetime
![Page 12: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/12.jpg)
Gambling Prevalence Goal is to predict how often
people… who fit in a particular
demographic group i.e., male versus female, white
versus black versus hispanic versus other
are classified as having a particular level of gambling risk
At risk, problem, or Pathalogic either during one specific year
or in their lifetime
![Page 13: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/13.jpg)
Gambling Prevalence Goal is to predict how often
people… who fit in a particular
demographic group i.e., male versus female, white
versus black versus hispanic versus other
are classified as having a particular level of gambling risk
At risk, problem, or Pathalogic either during one specific year
or in their lifetime
![Page 14: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/14.jpg)
Gambling Prevalence
* Risk is the most predictive feature.
![Page 15: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/15.jpg)
Gambling Prevalence
![Page 16: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/16.jpg)
Gambling Prevalence
![Page 17: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/17.jpg)
Gambling Prevalence
* Demographic is the least predictive feature.
![Page 18: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/18.jpg)
Which algorithm will perform best?
http://www.albanycitizenscouncil.org/Pictures/Gambling2.jpg
![Page 19: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/19.jpg)
Which algorithm will perform best?
Decision Trees .26 Kappa Naïve Bayes .31 Kappa SMO .53 Kappa
http://www.albanycitizenscouncil.org/Pictures/Gambling2.jpg
![Page 20: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/20.jpg)
Decision Trees
* What’s it ignoring and why?
![Page 21: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/21.jpg)
With Binary Splits – Kappa .41
![Page 22: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/22.jpg)
What was different with SMO? Trained a model for all pairs
The features that were important for one pairwise distinction were different than those for other pairwise distinctions
Characteristic=Black was most important for High versus Low (ignored by decision trees)
When and Risk were most important for High versus Medium
Decision Trees pay attention to all distinctions at onceTotally ignored feature that was important for
some pairwise distinctions
![Page 23: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/23.jpg)
What was wrong with Naïve Bayes? Probably just learned noisy probabilities
because the data set is small Hard to distinguish Low and Medium
![Page 24: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/24.jpg)
Back to Chapter 5
![Page 25: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/25.jpg)
Thinking about the cost of an error – A Theoretical Foundation for Machine Learning Cost
Making the right choice doesn’t cost you anything
Making an error comes with a costSome errors cost more than othersRather than evaluating your model in terms of
accuracy, which treats every error as though it was the same, you can think about average cost
The real cost is determined by your application
![Page 26: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/26.jpg)
Unified Framework Connection between optimization techniques and
evaluation methods Think about what function you are optimizing
That’s what learning is Evaluation measures how well you did that
optimization So it makes sense for there to be a deep connection
between the learning technique and the evaluation New machine learning algorithms are often
motivated by modifications to the conceptualization of the cost of an error
![Page 27: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/27.jpg)
What’s the cost of a gambling mistake?
http://imagecache2.allposters.com/images/pic/PTGPOD/321587~Pile-of-American-Money-Posters.jpg
![Page 28: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/28.jpg)
Thinking About the Practical Cost of an Error
In document retrieval, precision is more important than recall You’re picking from the whole web, so if you miss some
relevant documents it’s not a big deal Precision is more important – you don’t want to have to
slog through lot’s of irrelevant stuff
![Page 29: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/29.jpg)
Thinking About the Practical Cost of an Error
What if you are trying to predict whether someone will be late? Is it worse to not predict
someone will be late when they won’t or vice versa?
![Page 30: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/30.jpg)
Thinking About the Practical Cost of an Error
What if you’re trying to predict that a message will get a response or not?
![Page 31: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/31.jpg)
Thinking About the Practical Cost of an Error
Let’s say you are picking out errors in student essays If you detect an error, you offer
the student a correction for their error
What are the implications of missing an error?
What are the implications of imagining an error that doesn’t exist?
![Page 32: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/32.jpg)
Cost Sensitive Classification An example of the connection between the notion
of cost of an error and the training method Say you manipulate the cost of different types of
errors Cost of a decision is computed based on the expected
cost That affects the function the algorithm is “trying”
to maximize Minimize expected cost rather than maximizing
accuracy
![Page 33: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/33.jpg)
Cost Sensitive Classification Cost sensitive classifiers work in two ways
manipulate the composition of the training data (by either changing the weight of some instances or by artificially boosting the number of instances of some types by strategically including some duplicates)
Manipulate the way predictions are made Select the option that minimizes cost rather than the most
likely choice
In practice it’s hard to use cost-sensitive classification in a useful way
![Page 34: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/34.jpg)
Cost Sensitive Classification What if it’s 10 times more expensive to
make a mistake when selecting Class C
Expected cost of a decision j Cjpj
The cost of predicting class Cj is computed by multiplying the j column of the cost matrix by the corresponding probabilities
The expected cost of selecting C if probabilities are computed at A=75%, B=10%, C=15% is .75*10 + .1*1 = 7.6
0
0
01
1 1
1 10
1
A B C
A
B
C
![Page 35: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/35.jpg)
Cost Sensitive Classification The expected cost of selecting B if
probabilities are computed at A=75%, B=10%, C=15% is .75*1 + .15*1 = .9
If A is selected, expected cost is .1*1 + .15*1 = .25
You can make a choice by minimizing the expected cost of an error
So in this case, the expected cost is less when selecting A with highest probability
0
0
01
1 1
1 10
1
A B C
A
B
C
![Page 36: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/36.jpg)
Cost Sensitive Classification The expected cost of selecting B if
probabilities are computed at A=75%, B=10%, C=15% is .75*1 + .15*1 = .9
If A is selected, expected cost is .1*1 + .15*1 = .25
You can make a choice by minimizing the expected cost of an error
So in this case, the expected cost is less when selecting A with highest probability
0
0
01
1 1
1 10
1
A B C
A
B
C
![Page 37: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/37.jpg)
Using Cost Sensitive Classification
![Page 38: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/38.jpg)
Using Cost Sensitive Classification
![Page 39: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/39.jpg)
Using Cost Sensitive Classification
* Set up the cost matrix * Assign a high penalty to the largesterror cell
![Page 40: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/40.jpg)
Results Without Cost Sensitive Classification .53 Using Cost Sensitive Classification
increased performance to .55Tiny difference because SMO assigns
probability 1 to all predictionsNot statistically significant
SMO with default settings normally predicts one class with confidence 1 and the others with confidence 0, so cost sensitive classification does not have a big effect
![Page 41: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/41.jpg)
What is the cost of an error? Assume first all errors have the
same cost
Quadratic loss: j (pj – aj)2
Cost of a decision J iterates over classes (A, B, C)
Penalizes you for putting high confidence on a wrong prediction and/or low confidence on a right prediction
0
0
01
1 1
1 1
1
A B C
A
B
C
![Page 42: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/42.jpg)
What is the cost of an error? Assume first all errors have the same
cost Quadratic loss: j (pj – aj)2
Cost of a decision J iterates over classes (A, B, C)
If C is right and you say A=75%, B=10%, C=15% (.75 – 0)2 + (.1 – 0)2 + (.15 -1)2
1.3 If A is right and you say A=75%,
B=10%, C=15% (.75 – 1)2 + (.1 – 0)2 + (.15 – 0)2
.09 Lower cost if highest probability is on the
correct choice
0
0
01
1 1
1 1
1
A B C
A
B
C
![Page 43: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/43.jpg)
What is the cost of an error? Assume all errors have the same cost Informational Loss: -logkpi k is the number of classes i is the correct class Pi is the probability of selecting class i If C is right and you say A=75%, B=10%,
C=15% -log3(.15) 1.73
If A is right and you say A=75%, B=10%, C=15% -log3(.75) .26 Lower cost if highest probability is on the correct
choice
0
0
01
1 1
1 1
1
A B C
A
B
C
![Page 44: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/44.jpg)
Trade Offs Between Quadratic Loss and Information Loss Quadratic Loss pays attention to
probabilities placed on all classesSo you can get partial credit if you put really
low probabilities on some of the wrong choicesBounded (Max value is 2)
Information loss only pays attention to how you treated the correct predictionMore like gamblingNot bounded
![Page 45: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/45.jpg)
Minimum Description Length Principle
Another way of viewing the connection between optimization and evaluation Based on information theory
Training minimizes how much information you encode in the model How much information does it take to determine what
class an instance belongs to? Information is encoded in your feature space
Evaluation measures how much information is lost in the classification
Tension between complexity of the model at training time and information loss at testing time
![Page 46: Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute](https://reader035.vdocuments.net/reader035/viewer/2022062413/5a4d1b697f8b9ab0599b2318/html5/thumbnails/46.jpg)
Take Home Message Different types of errors have different costs
Costs associated with cells in the confusion matrix Costs may also be associated with the level of
confidence with which decisions are made Connection between concept of cost of an error
and learning method Machine learning algorithms are optimizing a cost
function The cost function should reflect the real cost in the
world In cost sensitive classification, the notion of which
types of errors cost more can influence classification performance