naive bayes

25
Building a Naive Bayes Classifier Eric Wilson Search Engineer Manta Media

Upload: eric-wilson

Post on 16-Nov-2014

865 views

Category:

Documents


3 download

DESCRIPTION

Slides for a talk at Wittenberg to undergraduates introducing the concept of a Naive Bayes Classifier.

TRANSCRIPT

Page 1: Naive Bayes

Building a Naive Bayes Classifier

Eric WilsonSearch Engineer

Manta Media

Page 2: Naive Bayes

The problem: Undesirable Content

Recommended by 3 people:

Bob PerkinsIt is a pleasure to work with Kim! Her work is beautiful and she is professional, communicative, and friendly.

FredShe lied and stole my money, STAY AWAY!!!!!

Jane RobinsonVery Quick Turn Around as asked - Synced up Perfectly Great Help!

Page 3: Naive Bayes

Possible solutions

● First approach: manually remove undesired content.

● Attempt to filter based on lists of banned words.

● Use a machine learning algorithm to identify undesirable content based on a small set of manually classified examples.

Page 4: Naive Bayes

Using Naive Bayes isn't too hard!

● We'll need a bit of probability, including the concept of conditional probability.

● A few natural language processing ideas will be necessary.

● Facility with any modern programming language.

● Persistence with many details.

Page 5: Naive Bayes

Probability 101

Suppose we choose a number from the set:

U = {1,2,3,4,5,6,7,8,9,10}

Let A be the event that the number is even, and B be the event that the number is prime.

Compute P(A), P(B), P(A|B), and P(B|A), where P(A|B) is the probability of A given B.

Page 6: Naive Bayes

Just count!

4

6

28

10

3

5

7

1 9

A B

P(A) = 5/10 = 1/2P(B) = 4/10 = 2/5P(A|B) = 1/4P(B|A) = 1/5

Page 7: Naive Bayes

Bayes Theorem

P(A|B) = P(AB)/P(B)

P(B)P(A|B) = P(AB)

P(B)P(A|B) = P(A)P(B|A)

P(A|B) = P(A)P(B|A)/P(B)

Page 8: Naive Bayes

A simplistic language model

Consider each document to be a set of words, along with frequencies.

For example: “The premium quality for the discount price” is viewed as:{'the':2, 'premium':1, 'quality':1, 'for':1, 'discount':1, 'price':1}

Same as “The discount quality for the premium price,” since we don't care about order.

Page 9: Naive Bayes

That seems … foolish

● English is so complicated that we won't have any real hope of understanding semantics.

● In many real-life scenarios, text that you want to classify is not exactly subtle.

● If necessary, we can improve our language model later.

Page 10: Naive Bayes

An example:

Type Text Class

Training Good happy good Positive

Training Good good service Positive

Training Good friendly Positive

Training Lousy good cheat Negative

Test Good good good cheat lousy ??

In order to be able to perform all calculations, we will use an example with extremely small documents.

Page 11: Naive Bayes

What was the question?

We are trying to determine whether the last recommendation was positive or negative.

We want to compute:

P(Pos|good good good lousy cheat)

By Bayes Theorem, this is equal to:

P(Pos)P(good good good lousy cheat|Pos)

P(good good good lousy cheat)

Page 12: Naive Bayes

What do we know?

P(Pos) = 3/4

P(good|Pos), P(cheat|Pos), P(lousy|Pos)

Are all easily computed by counting using the training set.

Which is almost what we want ...

Page 13: Naive Bayes

Wouldn't it be nice ...

Maybe we have all we need? Isn't

P(good good good lousy cheat|Pos) =

P(good|Pos)3P(lousy|Pos)P(cheat|Pos) ?

Well, yes, if these are independent events, which almost certainly doesn't hold.

The “naive” assumption is that we can consider these events independent.

Page 14: Naive Bayes

The Naive Bayes Algorithm

If C1,C

2,...,C

n are classes, and an instance has

features F1,F

2,...,F

m, then the most likely class

for this instance is the one that maximizes the following:

P(Ci )P(F

1|C

i )P(F

2|C

i )...P(F

m|C

i )

Page 15: Naive Bayes

Wasn't there a denominator?

If our goal was to compute the probability of the most likely class, we should divide by:

P(F1)P(F

2)...P(F

m)

We can ignore this part because, we only care about which class has the highest probability, and this term is the same for each class.

Page 16: Naive Bayes

Interesting theory but …

Won't this break as soon as we encounter a word that isn't in our training set?

For example, if “goood” is not in our training set, and occurs in our test set, then since P(Pos|goood) = 0, so our product is zero for all classes.

We need nonzero probabilities for all words, even words that don't exist.

Page 17: Naive Bayes

Plus-one smoothing

Just count every word one time more than it actually occurs.

Since we are only concerned with relative probabilities, this inaccuracy should be of no concern.

P(word|C) = count(word|C) + 1

count(C) + V

(V is the total vocabulary, so that our probabilities sum to 1.)

Page 18: Naive Bayes

Let's try it out:

P(Pos) = ¾

P(Neg) = ¼

Type Text Class

Training Good happy good Positive

Good good service Positive

Good friendly Positive

Lousy good cheat Negative

Test Good good good cheat lousy ??

P(Pos) = ¾

P(Neg) = ¼

P(good|Pos) = (5+1)/(8+6) = 3/7

P(cheat|Pos) = (0+1)/(8+6) = 1/14

P(lousy|Pos) = (0+1)/(8+6) = 1/14

P(good|Neg) = (1+1)/(3+6) = 2/9

P(cheat|Neg) = (1+1)/(3+6) = 2/9

P(lousy|Neg) = (1+1)/(3+6) = 2/9

P(Pos|D5) ~ ¾ * (3/7)3*(1/14)*(1/14)

= 0.0003

P(Neg|D5) ~ ¼ * (2/9)3*(2/9)*(2/9)

= 0.0001

Page 19: Naive Bayes

Training the classifier

● Count instances of classes, store counts in a map.● Store counts of all words in a nested map:

{'pos':

{'good': 5, 'friendly': 1, 'service': 1, 'happy': 1},

'neg':

{'cheat': 1, 'lousy': 1, 'good': 1}

}● Should be easy to compute probabilities.● Should be efficient (training time and memory.)

Page 20: Naive Bayes

Some practical problems

● Tokenization● Arithmetic● How to evaluate results?

Page 21: Naive Bayes

Tokenization

● Use whitespace?– “food”, “food.”, food,” and “food!” all different.

● Use whitespace and punctuation?– “won't” tokenized to “won” and “t”

● What about emails? Urls? Phone numbers? What about the things we haven't thought about yet?

● Use a library. Lucene is a good choice.

Page 22: Naive Bayes

Arithmetic

What happens when you multiply a large amount of small numbers?

To prevent underflow, use sums of logs instead of products of true probabilities.

Key properties of log:● log(AB) = log(A) + log(B)● x > y => log(x) > log(y)● Turns very small numbers into managable negative

numbers

Page 23: Naive Bayes

Evaluating a classifier

● Precision and recall● Confusion matrix● Divide training set into nine “folds”, train

classifier on nine folds, and check accuracy of classifying the tenth fold

Page 24: Naive Bayes

Experiment

● Tokenization strategies– Stop words

– Capitalization

– Stemming

● Language model– Ignore multiplicities

– Smoothing

Page 25: Naive Bayes

Contact me

[email protected]

[email protected]

● @wilsonericn

● http://wilsonericn.wordpress.com