machine learning in practice lecture 10 carolyn penstein rosé language technologies institute/...

Machine Learning in PracticeLecture 10

Carolyn Penstein RoséLanguage Technologies Institute/

Human-Computer Interaction Institute

http://www.theallusionist.com/wordpress/wp-content/uploads/gambling8.jpg

Plan for the Day Announcements

Questions?Quiz answer key posted

Today’s Data Set: Prevalence of Gambling Exploring the Concept of Cost

http://www.casino-gambling-dictionary.com/

Quiz Notes

Leave-one-out cross validation On each fold, train on all but 1 data point,

test on 1 data pointPro: Maximizes amount of training data used

on each foldCon: Not stratifiedCon: Take a long time on large sets

Best to only use when you have a very small amount of training dataOnly needed when 10-fold cross validation is

not feasible because of lack of data

632 Bootstrap A method for estimating performance when

you have a small data setConsider it an alternative to leave-one-out

cross validation Sample n times with replacement to create

the training setSome instances will be repeatedSome will be left out – this will be your test setAbout 63% of the instances in the original set

will end up in the training set

632 Bootstrap Estimating error over the training set will be an

optimistic estimate of performance Because you trained on these examples

Estimating error over test set will be a pessimistic estimate of the error Because the 63/37 split gives you less training data

than 90/10 Estimate error by combining optimistic and

pessimistic estimates .63*pessimistic_estimate + .37*optimistic_estimate

Iterate several times and average performance estimates

Prevalence of Gambling

Gambling Prevalence Goal is to predict how often

people… who fit in a particular

demographic group i.e., male versus female, white

versus black versus hispanic versus other

are classified as having a particular level of gambling risk

At risk, problem, or Pathalogic either during one specific year

or in their lifetime

Gambling Prevalence

* Risk is the most predictive feature.

Gambling Prevalence

Gambling Prevalence

* Demographic is the least predictive feature.

Which algorithm will perform best?

http://www.albanycitizenscouncil.org/Pictures/Gambling2.jpg

Which algorithm will perform best?

Decision Trees .26 Kappa Naïve Bayes .31 Kappa SMO .53 Kappa

http://www.albanycitizenscouncil.org/Pictures/Gambling2.jpg

Decision Trees

* What’s it ignoring and why?

With Binary Splits – Kappa .41

What was different with SMO? Trained a model for all pairs

The features that were important for one pairwise distinction were different than those for other pairwise distinctions

Characteristic=Black was most important for High versus Low (ignored by decision trees)

When and Risk were most important for High versus Medium

Decision Trees pay attention to all distinctions at onceTotally ignored feature that was important for

some pairwise distinctions

What was wrong with Naïve Bayes? Probably just learned noisy probabilities

because the data set is small Hard to distinguish Low and Medium

Back to Chapter 5

Thinking about the cost of an error – A Theoretical Foundation for Machine Learning Cost

Making the right choice doesn’t cost you anything

Making an error comes with a costSome errors cost more than othersRather than evaluating your model in terms of

accuracy, which treats every error as though it was the same, you can think about average cost

The real cost is determined by your application

Unified Framework Connection between optimization techniques and

evaluation methods Think about what function you are optimizing

That’s what learning is Evaluation measures how well you did that

optimization So it makes sense for there to be a deep connection

between the learning technique and the evaluation New machine learning algorithms are often

motivated by modifications to the conceptualization of the cost of an error

What’s the cost of a gambling mistake?

http://imagecache2.allposters.com/images/pic/PTGPOD/321587~Pile-of-American-Money-Posters.jpg

Thinking About the Practical Cost of an Error

In document retrieval, precision is more important than recall You’re picking from the whole web, so if you miss some

relevant documents it’s not a big deal Precision is more important – you don’t want to have to

slog through lot’s of irrelevant stuff


What if you are trying to predict whether someone will be late? Is it worse to not predict

someone will be late when they won’t or vice versa?


What if you’re trying to predict that a message will get a response or not?


Let’s say you are picking out errors in student essays If you detect an error, you offer

the student a correction for their error

What are the implications of missing an error?

What are the implications of imagining an error that doesn’t exist?

Cost Sensitive Classification An example of the connection between the notion

of cost of an error and the training method Say you manipulate the cost of different types of

errors Cost of a decision is computed based on the expected

cost That affects the function the algorithm is “trying”

to maximize Minimize expected cost rather than maximizing

accuracy

Cost Sensitive Classification Cost sensitive classifiers work in two ways

manipulate the composition of the training data (by either changing the weight of some instances or by artificially boosting the number of instances of some types by strategically including some duplicates)

Manipulate the way predictions are made Select the option that minimizes cost rather than the most

likely choice

In practice it’s hard to use cost-sensitive classification in a useful way

Cost Sensitive Classification What if it’s 10 times more expensive to

make a mistake when selecting Class C

Expected cost of a decision j Cjpj

The cost of predicting class Cj is computed by multiplying the j column of the cost matrix by the corresponding probabilities

The expected cost of selecting C if probabilities are computed at A=75%, B=10%, C=15% is .75*10 + .1*1 = 7.6

0

0

01

1 1

1 10

1

A B C

A

B

C

Cost Sensitive Classification The expected cost of selecting B if

probabilities are computed at A=75%, B=10%, C=15% is .75*1 + .15*1 = .9

If A is selected, expected cost is .1*1 + .15*1 = .25

You can make a choice by minimizing the expected cost of an error

So in this case, the expected cost is less when selecting A with highest probability

0

0

01

1 1

1 10

1

A B C

A

B

C

Using Cost Sensitive Classification

Using Cost Sensitive Classification

* Set up the cost matrix * Assign a high penalty to the largesterror cell

Results Without Cost Sensitive Classification .53 Using Cost Sensitive Classification

increased performance to .55Tiny difference because SMO assigns

probability 1 to all predictionsNot statistically significant

SMO with default settings normally predicts one class with confidence 1 and the others with confidence 0, so cost sensitive classification does not have a big effect

What is the cost of an error? Assume first all errors have the

same cost

Quadratic loss: j (pj – aj)2

Cost of a decision J iterates over classes (A, B, C)

Penalizes you for putting high confidence on a wrong prediction and/or low confidence on a right prediction

0

0

01

1 1

1 1

1

A B C

A

B

C

What is the cost of an error? Assume first all errors have the same

cost Quadratic loss: j (pj – aj)2

Cost of a decision J iterates over classes (A, B, C)

If C is right and you say A=75%, B=10%, C=15% (.75 – 0)2 + (.1 – 0)2 + (.15 -1)2

1.3 If A is right and you say A=75%,

B=10%, C=15% (.75 – 1)2 + (.1 – 0)2 + (.15 – 0)2

.09 Lower cost if highest probability is on the

correct choice

0

0

01

1 1

1 1

1

A B C

A

B

C

What is the cost of an error? Assume all errors have the same cost Informational Loss: -logkpi k is the number of classes i is the correct class Pi is the probability of selecting class i If C is right and you say A=75%, B=10%,

C=15% -log3(.15) 1.73

If A is right and you say A=75%, B=10%, C=15% -log3(.75) .26 Lower cost if highest probability is on the correct

choice

0

0

01

1 1

1 1

1

A B C

A

B

C

Trade Offs Between Quadratic Loss and Information Loss Quadratic Loss pays attention to

probabilities placed on all classesSo you can get partial credit if you put really

low probabilities on some of the wrong choicesBounded (Max value is 2)

Information loss only pays attention to how you treated the correct predictionMore like gamblingNot bounded

Minimum Description Length Principle

Another way of viewing the connection between optimization and evaluation Based on information theory

Training minimizes how much information you encode in the model How much information does it take to determine what

class an instance belongs to? Information is encoded in your feature space

Evaluation measures how much information is lost in the classification

Tension between complexity of the model at training time and information loss at testing time

Take Home Message Different types of errors have different costs

Costs associated with cells in the confusion matrix Costs may also be associated with the level of

confidence with which decisions are made Connection between concept of cost of an error

and learning method Machine learning algorithms are optimizing a cost

function The cost function should reflect the real cost in the

world In cost sensitive classification, the notion of which

types of errors cost more can influence classification performance

machine learning in practice lecture 10 carolyn penstein rosé language technologies institute/...

Documents