python & web mining old dominion university department of computer science hany salaheldeen...
TRANSCRIPT
Python & Web Mining
Old Dominion UniversityDepartment of Computer Science
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
Lecture 5
CS 495 Fall 2012
Hany SalahEldeen Khalil [email protected]
10-03-12
Presented & Prepared by: Justin F. [email protected]
Document Filtering
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
In a nutshell: It is classifying documents based on their content. This classification could be binary (good/bad, spam/not-spam) or n-ary (school-related-emails, work-related, commercials…etc)
Why do we need Document filtering?
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Eliminate spam.• Removing unrelated comments in forums
and public message boards.• Classifying social /work-related emails
automatically.• Forwarding information-request emails
to the expert who is most capable of answering the email.
Spam Filtering
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• First it was rule-based classifiers:• Overuse capital letters• Words related to pharmaceutical
products• Garish HTML colors
Cons of using Rule-based classifiers
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Easy to trick by just avoiding patterns of capital letters…etc.
• What is considered spam varies from one to another.• Ex: Inbox of a medical rep Vs. email of
a house-wife.
Solution
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Develop programs that learn.• Teach them the differences and how to
recognize each class by providing examples of each class.
Features
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• We need to extract features from documents to classify them.
• Feature: Is anything that you can determine as being either present or absent in the item.
Definitions
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• item = document• feature = word• classification = {good|bad}
Dictionary Building
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Remember:• Removing capital letters reduce the
total number of features by removing the SHOUTING style.
• Size of the features also is crucial (using entire email as feature Vs. each letter a feature)
Classifier Training
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• It is designed to start off very uncertain.• Increase certainty upon learning
features.
Probabilities
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• It’s a number between 0-1 indicating how likely an event is.
Probabilities
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• ‘quick’ appeared in 2 documents as good and the total number of good documents is 3
Conditional Probabilities
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
Pr(A|B) = “probability of A given B”
fprob(quick|good) = “probability of quick given good”
= (quick classified as good) / (total good items) = 2 / 3
Starting with Reasonable guess
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Using the info we seen so far makes it extremely sensitive in early training stages
• Ex: “money”• Money appeared in casino training
document as bad• It appears with probability = 0 for
good which is not right!
Solution: Start with assumed probability
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Start for instance with 0.5 probability for each feature
• Also decide the weight chosen for the assumed probability you will take.
Assumed Probability
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> cl.fprob('money','bad') 0.5>>> cl.fprob('money','good') 0.0
we have data for bad, but should we start with 0 probability for money given good?
>>> cl.weightedprob('money','good',cl.fprob) 0.25>>> docclass.sampletrain(cl) Nobody owns the water.the quick rabbit jumps fencesbuy pharmaceuticals nowmake quick money at the online casinothe quick brown fox jumps>>> cl.weightedprob('money','good',cl.fprob) 0.16666666666666666>>> cl.fcount('money','bad')3.0>>> cl.weightedprob('money','bad',cl.fprob) 0.5
define an assumed probability of 0.5then weightedprob() returns the weighted mean of fprob and the assumed probability
weightedprob(money,good) = (weight * assumed + count * fprob()) / (count + weight)
= (1*0.5 + 1*0) / (1+1)= 0.5 / 2= 0.25(double the training)= (1*0.5 + 2*0) / (2+1)= 0.5 / 3= 0.166
Pr(money|bad) remains = (0.5 + 3*0.5) / (3+1) = 0.5
Naïve Bayesian Classifier
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Move from terms to documents:Pr(document) = Pr(term1) * Pr(term2) * … * Pr(termn)
• Naïve because we assume all terms occur independently • we know this is as simplifying assumption; it is
naïve to think all terms have equal probability for completing this phrase:
• “Shave and a hair cut ___ ____” • Bayesian because we use Bayes’ Theorem to
invert the conditional probabilities
Bayes Theorem
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Given our training data, we know: Pr(feature|classification)
• What we really want to know is: Pr(classification|feature)
• Bayes’ Theorem* :Pr(A|B) = Pr(B|A) Pr(A) / Pr(B)
Pr(good|doc) = Pr(doc|good) Pr(good) / Pr(doc)* http://en.wikipedia.org/wiki/Bayes%27_theorem
we know how to calculate this
#good / #totalwe skip this sinceit is the same foreach classificationOr:
Our Bayesian Classifier
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> import docclass >>> cl=docclass.naivebayes(docclass.getwords)>>> docclass.sampletrain(cl) Nobody owns the water.the quick rabbit jumps fencesbuy pharmaceuticals nowmake quick money at the online casinothe quick brown fox jumps>>> cl.prob('quick rabbit','good') quick rabbit0.15624999999999997>>> cl.prob('quick rabbit','bad') quick rabbit0.050000000000000003>>> cl.prob('quick rabbit jumps','good') quick rabbit jumps0.095486111111111091>>> cl.prob('quick rabbit jumps','bad') quick rabbit jumps0.0083333333333333332
we use these valuesonly for comparison,
not as “real” probabilities
Bayesian Classifier
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Testing
Classification Thresholds
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> cl.prob('quick rabbit','good') quick rabbit0.15624999999999997>>> cl.prob('quick rabbit','bad') quick rabbit0.050000000000000003>>> cl.classify('quick rabbit',default='unknown') quick rabbitu'good'>>> cl.prob('quick money','good') quick money0.09375>>> cl.prob('quick money','bad') quick money0.10000000000000001>>> cl.classify('quick money',default='unknown') quick moneyu'bad'>>> cl.setthreshold('bad',3.0)>>> cl.classify('quick money',default='unknown') quick money'unknown'>>> cl.classify('quick rabbit',default='unknown')quick rabbitu'good'
only classify something as bad if it is 3X more likely to
be bad than good
Classification Thresholds…cont
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> for i in range(10): docclass.sampletrain(cl) >>> cl.prob('quick money','good') quick money0.016544117647058824>>> cl.prob('quick money','bad') quick money0.10000000000000001>>> cl.classify('quick money',default='unknown') quick moneyu'bad'>>> cl.prob('quick rabbit','good') quick rabbit0.13786764705882351>>> cl.prob('quick rabbit','bad') quick rabbit0.0083333333333333332>>> cl.classify('quick rabbit',default='unknown') quick rabbitu'good'
Fisher Method
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Normalize the frequencies for each category• e.g., we might have far more “bad” training data
than good, so the net cast by the bad data will be “wider” than we’d like
• Calculate normalized Bayesian probability, then fit the result to an inverse chi-square function to see what is the probability that a random document of that classification would have those features (i.e., terms)
Fisher Example
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> import docclass>>> cl=docclass.fisherclassifier(docclass.getwords) >>> cl.setdb('mln.db')>>> docclass.sampletrain(cl)>>> cl.cprob('quick','good') 0.57142857142857151>>> cl.fisherprob('quick','good') quick0.5535714285714286>>> cl.fisherprob('quick rabbit','good') quick rabbit0.78013986588957995>>> cl.cprob('rabbit','good') 1.0>>> cl.fisherprob('rabbit','good') rabbit0.75>>> cl.cprob('quick','good') 0.57142857142857151>>> cl.cprob('quick','bad') 0.4285714285714286
Fisher Example
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> cl.cprob('money','good') 0>>> cl.cprob('money','bad') 1.0>>> cl.cprob('buy','bad') 1.0>>> cl.cprob('buy','good') 0>>> cl.fisherprob('money buy','good') money buy0.23578679513998632>>> cl.fisherprob('money buy','bad') money buy0.8861423315082535>>> cl.fisherprob('money quick','good') money quick0.41208671548422637>>> cl.fisherprob('money quick','bad') money quick0.70116895256207468
Classification with Inverse Chi-Square
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> cl.fisherprob('quick rabbit','good')quick rabbit0.78013986588957995>>> cl.classify('quick rabbit') quick rabbitu'good'>>> cl.fisherprob('quick money','good') quick money0.41208671548422637>>> cl.classify('quick money') quick moneyu'bad'>>> cl.setminimum('bad',0.8) >>> cl.classify('quick money') quick moneyu'good'>>> cl.setminimum('good',0.4) >>> cl.classify('quick money') quick moneyu'good'>>> cl.setminimum('good',0.42) >>> cl.classify('quick money') quick money
this version of the classifier does notprint “unknown” as a classification
in practice, we’ll tolerate false positives for “good” more than
false negatives for “good” -- we’d rather see a mesg that is spam rather than lose a mesg that is not spam.
Fisher -- Simplified
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Reduces the signal – to – noise ratios• Assumes document occur with normal
distribution• Estimates differences in corpus size with X-
squared• “Chi”-squared is a “goodness-of-fit” b/t an
observed distribution and theoretical distribution• Utilizes confidence interval & std. dev. estimations
for a corpus• http://
en.wikipedia.org/w/index.php?title=File:Chi-square_pdf.svg&page=1