naΓ―ve bayes classifiers - courses.cs.washington.edu...using this one weird trick! meet singles near...
TRANSCRIPT
NaΓ―ve Bayes Classifiers
Jonathan Lee and Varun Mahadevan
Independence
Recap:
Definition: Two events X and Y are independent if
π(ππ) = π(π)π(π),
and if π π > 0, then π π π = π(π)
Conditional Independence
Conditional Independence tells us that: Two events A and B are conditionally independent given C if
π π΄ β© π΅ πΆ = π π΄ πΆ π(π΅|πΆ),
and if P(B) > 0 and P(C) > 0, then π π΄ π΅πΆ = π π΄ πΆ
Example:
Randomly choose a day of the week
A = { It is not a Monday }
B = { It is a Saturday }
C = { It is the weekend }
A and B are dependent events
P(A) = 6/7, P(B) = 1/7, P(AB) = 1/7
Now condition both A and B on C:
P(A|C) = 1, P(B|C) = Β½, P(AB|C) = Β½
P(AB|C) = P(A|C)P(B|C) => A|C and B|C are independent
Programming Project: Spam Filter
Due: Check the Calendar
β’ Implement a Naive Bayes classifier for classifying emails as either spam or ham.
β’ You may use C, Java, Python, or R; ask if you have a different preference. Weβve provided starter code in Java, Python and R.
β’ Read Jonathanβs notes on the website, start early, and ask for help if you get stuck!
Spam vs. Ham
β’ (at least in the past) the bane of any email userβs existence
β’ You know it when you see it!
β’ Easy for humans to identify, but not necessarily easy for computers
β’ Less of a problem for consumers now, because spam filters have gotten really goodβ¦
The spam classification problem
β’ Input: collection of emails, already labelled spam or ham β’ Someone has to label these by hand!
β’ Usually called the training data
β’ Use this data to train a model that βunderstandsβ what makes an email spam or ham β’ Weβre using a NaΓ―ve Bayes classifier, but there are other
approaches
β’ This is a Machine Learning problem (take 446 for more!)
β’ Test your model on emails whose label isnβt known to the model, and see how well it does β’ Usually called the test data
NaΓ―ve Bayes in the real world
β’ One of the oldest, simplest models for classification
β’ Still, very powerful and used all the time in the real world/industry β’ Identifying credit card fraud
β’ Identifying fake Amazon reviews
β’ Identifying vandalism on Wikipedia
β’ Still used (with modifications) by Gmail to prevent spam
β’ Facial recognition
β’ Categorizing Google News articles
β’ Even used for medical diagnosis!
How do we represent an email?
SUBJECT: Top Secret Business Venture Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secretβ¦
(top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virture, of, its, nature, as, being, utterly, confidencial, and)
β’ Thereβs a lot of different things about emails that might give a computer a hint about whether or not itβs spam β’ Possible features: words in body, subject line, sender, message header, time
sentβ¦
β’ For this assignment, we choose to represent an email just as the set of distinct words (π₯1, π₯2, β¦ , π₯π) in the subject and body
Notice that there are no duplicate words!
NaΓ―ve Bayes in theory
The concepts behind NaΓ―ve Bayes are nothing new to you -- weβll be using what weβve learned in the past few weeks.
Specifically
β’ Bayes Theorem
π π΄ π΅ =π π΅|π΄ π π΄
π(π΅)
β’ Law of Total Probability
π(π΄) = π π΄ π΅π π(π΅π)π
β’ Chain Rule π π΄1, β¦ , π΄π =π π΄1 π π΄2 π΄1 β¦π π΄π π΄πβ1β¦π΄1
β’ Conditional Independence
π π΄ β© π΅ πΆ = π π΄ πΆ π(π΅|πΆ)
β’ Conditional Probability
π π΄|π΅ =π π΄β©π΅
π(π΅)
Programming Project
β’ Take the set of distinct words (π₯1, π₯2, β¦ , π₯π) to represent the text in an email.
β’ We are trying to compute
π ππππ π₯1, π₯2, β¦ , π₯π = ? ? ?
β’ By applying Bayes Theorem, we can reverse the conditioning. Itβs easier to find the probability of a word appearing in a spam email than the reverse.
π ππππ π₯1, π₯2, β¦ , π₯π =π π₯1, π₯2, β¦ π₯π ππππ π(ππππ)
π(π₯1, π₯2, β¦ , π₯π)
=π π₯1, π₯2, β¦ π₯π ππππ π(ππππ)
π π₯1, π₯2, β¦ π₯π ππππ π ππππ + π π₯1, π₯2, β¦ π₯π π»ππ π(π»ππ)
Programming Project
β’ Letβs take a look at the numerator and apply the rule for Conditional Probability
π π₯1, π₯2, β¦ π₯π ππππ π ππππ = π(π₯1, π₯2, β¦ , π₯π, ππππ)
β’ And now letβs use the Chain Rule to decompose this
π π₯1, π₯2, β¦ , π₯π, ππππ = π π₯1 π₯2, β¦ , π₯π, ππππ π(π₯2, β¦ , π₯π, ππππ)
= π π₯1 π₯2, β¦ , π₯π, ππππ π(π₯2|π₯3, β¦ , π₯π, ππππ)π(π₯3, β¦ , π₯π, ππππ)
= β―
= π π₯1 π₯2, β¦ , π₯π, ππππ β¦π π₯πβ1 π₯π, ππππ π π₯π ππππ π(ππππ)
But this is still hard to compute.
β’ Letβs simplify the problem with an assumption.
β’ We will assume that the words in the email are conditionally independent of each other, given that we know whether or not the email is spam.
β’ This is why we call this NaΓ―ve Bayes. This isnβt true irl!
β’ So how does this help?
π π₯1, π₯2, β¦ , π₯π, ππππ= π π₯1 π₯2, β¦ , π₯π, ππππ β¦π π₯πβ1 π₯π, ππππ π π₯π ππππ π(ππππ)
β π π₯1 ππππ π(π₯2|ππππ)β¦π π₯πβ1 ππππ π π₯π ππππ π(ππππ)
π(π₯1, π₯2, β¦ , π₯π, ππππ) β π(ππππ) π(π₯π|ππππ)
π
π=1
Programming Project
β’ So we know that
π(π₯1, π₯2, β¦ , π₯π, ππππ) β π(ππππ) π(π₯π|ππππ)
π
π=1
β’ Similarly
π(π₯1, π₯2, β¦ , π₯π, π»ππ) β π(π»ππ) π(π₯π|π»ππ)
π
π=1
β’ Putting it all together
π ππππ π₯1, π₯2, β¦ , π₯π βπ(ππππ) π(π₯π|ππππ)
ππ=1
π ππππ π π₯π ππππππ=1 + π(π»ππ) π(π₯π|π»ππ)
ππ=1
How spammy is a word?
β’ Have a nice formula for email spam probability, using conditional probabilities of words given ham/spam
β’ π ππππ and π(π»ππ) are just the proportion of total emails that are spam and ham
β’ What is π(π£πππππ|ππππ) asking?
β’ Would be easy to just count up how many spam emails have this word in them, so
π π€ ππππ = ππ’ππππ ππ π πππ ππππππ ππππ‘ππππππ π€
π‘ππ‘ππ ππ’ππππ ππ π πππ ππππππ (maybe)
β’ This seems reasonable, but thereβs a problemβ¦
β’ Suppose the word Pokemon only ever showed up in ham emails in the training data, never in spam
π πππππππ ππππ = 0
β’ Since the overall spam probability is the product of a bunch of individual probabilities, if any of those is 0, the whole thing is 0
β’ Any email with the word Pokemon would be assigned a spam probability of 0
β’ What can we do?
SUBJECT: Get out of debt! Cheap prescription pills! Earn fast cash using this one weird trick! Meet singles near you and get preapproved for a low interest credit card! Pokemon
definitely not spam, right?
Laplace smoothing
β’ Crazy idea: what if we pretend weβve seen every outcome once already?
β’ Pretend weβve seen one more spam email with π€, one more without π€
π π€ ππππ = ππ’ππππ ππ π πππ ππππππ ππππ‘ππππππ π€ + 1
π‘ππ‘ππ ππ’ππππ ππ π πππ ππππππ + 2
β’ Then, π πππππππ ππππ > 0
β’ No one word can βpoisonβ the overall probability too much
β’ General technique to avoid assuming that unseen events will never happen