![Page 1: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/1.jpg)
1
Using Binsto Empirically Estimate Term Weights
for Text Categorization
Carl Sable (Columbia University)
Kenneth W. Church (AT&T)
![Page 2: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/2.jpg)
2
Binning Overview
I. Task and Corpus: Multimedia news documents
II. Related Work: –Naïve Bayes–Smoothing & Speech Recognition–Binning in Information Retrieval
III. Our Proposal:–Use bins for Text Categorization
IV. Results and Evaluation:–Binning: rarely hurts, sometimes helps
V. Reuters:–Standard benchmark evaluation
VI. Conclusions: Robust version of Naïve Bayes
![Page 3: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/3.jpg)
3
Outdoor Indoor
![Page 4: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/4.jpg)
4
Clues for Indoor/Outdoor:Text (as opposed to Vision)
Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21.
Villagers look at the broken tail-end of the Fokker 28 Biman Bangladesh Airlines jet December 23, a day after it crash-landed near the town of Sylhet, in northeastern Bangladesh.
![Page 5: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/5.jpg)
5
Event Categories
Politics Struggle
Disaster CrimeOther
![Page 6: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/6.jpg)
6
Manual Categorization Tool
![Page 7: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/7.jpg)
7
Related Work
• Naïve Bayes
• Jelinek, 1998– Smoothing techniques for Speech Recognition– Deleted Interpolation (binning)
• Umemura and Church, 2000– Applied binning to Information Retrieval
)|()(maxarg
i
jijCc
cwPcPcj
![Page 8: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/8.jpg)
8
Bin System:Naïve Bayes + Smoothing
• Binning: based on smoothing in speech recognition
• Not enough training data to estimate weights (log likelihood ratios) for each word– But there would be enough training data if we group
words with similar “features” into a common “bin”
• Estimate a single weight for each bin– This weight is assigned to all words in the bin
• Credible estimates even for small counts (zeros)
![Page 9: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/9.jpg)
9
Intuition WordIndoor Freq
Outdoor Freq IDF Burstiness
Clearly Indoor
conference 15 0 2.5 0
bed 1 0 4.5 0
Clearly Outdoor
airplane 0 2 5.4 1
earthquake 0 4 4.6 1
UnclearGore 1 1 4.5 1
ceremony 5 6 3.9 0
![Page 10: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/10.jpg)
10
“airplane”
• Sparse data• First half of training set:
– “airplane” appears in• 2 outdoor documents
• 0 indoor documents
• Infinitely more likely to be outdoor???• Assign “airplane” to bins of words with similar
features (e.g., IDF, burstiness, counts)
![Page 11: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/11.jpg)
11
Lambdas: Weights• First half of training set: Assign words to bins• Second half of training set: Calibrate
– Average weights over words in bin
binword ||
)(||
1)|(docswordDF
binbinobsP
)|(log2bin binobsP
![Page 12: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/12.jpg)
12
Lambdas for “airplane”:14 times more likely to be outdoor than indoor
410*11.2)binindoor |obs( P
310*90.2)binoutdoor |obs( P
78.3)binoutdoor |obs(
)binindoor |obs(log2
P
P
![Page 13: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/13.jpg)
13
Binning Credible Log Likelihood Ratios
Intuition Word LambdaIndoor Freq
Outdoor Freq IDF Burstiness
Clearly Indoor
conference 5.9 15 0 2.5 0
bed 4.6 1 0 4.5 0
Clearly Outdoor
airplane -3.8 0 2 5.4 1
earthquake -4.9 0 4 4.6 1
UnclearGore 0.7 1 1 4.5 1
ceremony -0.3 5 6 3.9 0
![Page 14: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/14.jpg)
14
Evaluation
• Mutually exclusive categories
• Performance measured by overall accuracy:
sprediction total#
spredictioncorrect #Accuracy
![Page 15: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/15.jpg)
15
Bins: Robust Version of Naïve BayesPerformance is often similar,
but can be much better
70.0%
75.0%
80.0%
85.0%
90.0%
81.0%
83.0%
85.0%
87.0%
89.0%
Bins
Naïve Bayes
Indoor/Outdoor Events: Politics, Struggle, Disaster, Crime, Other
![Page 16: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/16.jpg)
16
Bins: Robust Version of Naïve BayesPerforms well against other alternatives
70.0%
75.0%
80.0%
85.0%
90.0%
81.0%
83.0%
85.0%
87.0%
89.0%Bins
Naïve Bayes
Rocchio 1
KNN
PrInd
SVM
MaxEnt
Rocchio 2
Density
Indoor/Outdoor Events: Politics, Struggle, Disaster, Crime, Other
![Page 17: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/17.jpg)
17
Reuters http://www.research.att.com/~lewis/reuters21578.html
• Common corpus for comparing methods– Over 10,000 articles, 90 topic categories
• Modified method to output multiple cats for each doc– One category per document
• Indoor/outdoor & politics/struggle/disaster/crime/other
– Multiple (0 or more) categories per document • Reuters
Doc #5 grain, wheat, corn, barley, oat, sorghum
Doc # 9earn
Doc # 448gold, acq, platinum
![Page 18: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/18.jpg)
18
Evaluation for Reuters:Accuracy Precision/Recall (F)
• Accuracy is misleading when documents are assigned multiple categories
• Use precision & recall instead
• F-measure: combines precision & recall
• Macro-averaging vs. micro-averaging– Macro: average over categories
– Micro: average over documents
• Macro usually lower– Since small categories are hard
p = a / (a + b)
r = a / (a + c)
Contingency Table:
rp
r*p*2F1
“yes” is correct
“no” is correct
Assigned “yes” a b
Assigned “no” c d
![Page 19: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/19.jpg)
19
Bins: Robust Version of Naïve BayesPerformance is often similar,
but can be much better
Reuters: Micro-F1
79%
81%
83%
85%
87%
NB Bin
Macro-F1
35%
40%
45%
50%
55%
![Page 20: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/20.jpg)
20
Bins: Robust Version of Naïve Bayes
Reuters: Micro-F1
79%
81%
83%
85%
87%
SVM KNN LSF NNet NB Bin
Macro-F1
35%
40%
45%
50%
55%
![Page 21: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)](https://reader031.vdocuments.net/reader031/viewer/2022020320/56649e905503460f94b95229/html5/thumbnails/21.jpg)
21
Conclusions
• Binning: Robust version of Naïve Bayes– Often helps, rarely hurts– Smoothing: borrowed from Speech Recognition– Reliable log-likelihood ratios even for small counts:
• airplane: 2 outdoor docs, 0 indoor docs – 14 times more likely to be outdoor than indoor
• Three Evaluations– Indoor vs. Outdoor (mutually exclusive categories)– Events (mutually exclusive categories)– Reuters (many-to-many)