1 using bins to empirically estimate term weights for text categorization carl sable (columbia...

21
1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

Upload: clemence-howard

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 2: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

2

Binning Overview

I. Task and Corpus: Multimedia news documents

II. Related Work: –Naïve Bayes–Smoothing & Speech Recognition–Binning in Information Retrieval

III. Our Proposal:–Use bins for Text Categorization

IV. Results and Evaluation:–Binning: rarely hurts, sometimes helps

V. Reuters:–Standard benchmark evaluation

VI. Conclusions: Robust version of Naïve Bayes

Page 4: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

4

Clues for Indoor/Outdoor:Text (as opposed to Vision)

Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21.

Villagers look at the broken tail-end of the Fokker 28 Biman Bangladesh Airlines jet December 23, a day after it crash-landed near the town of Sylhet, in northeastern Bangladesh.

Page 5: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

5

Event Categories

Politics Struggle

Disaster CrimeOther

Page 6: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

6

Manual Categorization Tool

Page 7: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

7

Related Work

• Naïve Bayes

• Jelinek, 1998– Smoothing techniques for Speech Recognition– Deleted Interpolation (binning)

• Umemura and Church, 2000– Applied binning to Information Retrieval

)|()(maxarg

i

jijCc

cwPcPcj

Page 8: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

8

Bin System:Naïve Bayes + Smoothing

• Binning: based on smoothing in speech recognition

• Not enough training data to estimate weights (log likelihood ratios) for each word– But there would be enough training data if we group

words with similar “features” into a common “bin”

• Estimate a single weight for each bin– This weight is assigned to all words in the bin

• Credible estimates even for small counts (zeros)

Page 9: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

9

Intuition WordIndoor Freq

Outdoor Freq IDF Burstiness

Clearly Indoor

conference 15 0 2.5 0

bed 1 0 4.5 0

Clearly Outdoor

airplane 0 2 5.4 1

earthquake 0 4 4.6 1

UnclearGore 1 1 4.5 1

ceremony 5 6 3.9 0

Page 10: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

10

“airplane”

• Sparse data• First half of training set:

– “airplane” appears in• 2 outdoor documents

• 0 indoor documents

• Infinitely more likely to be outdoor???• Assign “airplane” to bins of words with similar

features (e.g., IDF, burstiness, counts)

Page 11: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

11

Lambdas: Weights• First half of training set: Assign words to bins• Second half of training set: Calibrate

– Average weights over words in bin

binword ||

)(||

1)|(docswordDF

binbinobsP

)|(log2bin binobsP

Page 12: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

12

Lambdas for “airplane”:14 times more likely to be outdoor than indoor

410*11.2)binindoor |obs( P

310*90.2)binoutdoor |obs( P

78.3)binoutdoor |obs(

)binindoor |obs(log2

P

P

Page 13: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

13

Binning Credible Log Likelihood Ratios

Intuition Word LambdaIndoor Freq

Outdoor Freq IDF Burstiness

Clearly Indoor

conference 5.9 15 0 2.5 0

bed 4.6 1 0 4.5 0

Clearly Outdoor

airplane -3.8 0 2 5.4 1

earthquake -4.9 0 4 4.6 1

UnclearGore 0.7 1 1 4.5 1

ceremony -0.3 5 6 3.9 0

Page 14: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

14

Evaluation

• Mutually exclusive categories

• Performance measured by overall accuracy:

sprediction total#

spredictioncorrect #Accuracy

Page 15: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

15

Bins: Robust Version of Naïve BayesPerformance is often similar,

but can be much better

70.0%

75.0%

80.0%

85.0%

90.0%

81.0%

83.0%

85.0%

87.0%

89.0%

Bins

Naïve Bayes

Indoor/Outdoor Events: Politics, Struggle, Disaster, Crime, Other

Page 16: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

16

Bins: Robust Version of Naïve BayesPerforms well against other alternatives

70.0%

75.0%

80.0%

85.0%

90.0%

81.0%

83.0%

85.0%

87.0%

89.0%Bins

Naïve Bayes

Rocchio 1

KNN

PrInd

SVM

MaxEnt

Rocchio 2

Density

Indoor/Outdoor Events: Politics, Struggle, Disaster, Crime, Other

Page 17: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

17

Reuters http://www.research.att.com/~lewis/reuters21578.html

• Common corpus for comparing methods– Over 10,000 articles, 90 topic categories

• Modified method to output multiple cats for each doc– One category per document

• Indoor/outdoor & politics/struggle/disaster/crime/other

– Multiple (0 or more) categories per document • Reuters

Doc #5 grain, wheat, corn, barley, oat, sorghum

Doc # 9earn

Doc # 448gold, acq, platinum

Page 18: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

18

Evaluation for Reuters:Accuracy Precision/Recall (F)

• Accuracy is misleading when documents are assigned multiple categories

• Use precision & recall instead

• F-measure: combines precision & recall

• Macro-averaging vs. micro-averaging– Macro: average over categories

– Micro: average over documents

• Macro usually lower– Since small categories are hard

p = a / (a + b)

r = a / (a + c)

Contingency Table:

rp

r*p*2F1

“yes” is correct

“no” is correct

Assigned “yes” a b

Assigned “no” c d

Page 19: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

19

Bins: Robust Version of Naïve BayesPerformance is often similar,

but can be much better

Reuters: Micro-F1

79%

81%

83%

85%

87%

NB Bin

Macro-F1

35%

40%

45%

50%

55%

Page 20: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

20

Bins: Robust Version of Naïve Bayes

Reuters: Micro-F1

79%

81%

83%

85%

87%

SVM KNN LSF NNet NB Bin

Macro-F1

35%

40%

45%

50%

55%

Page 21: 1 Using Bins to Empirically Estimate Term Weights for Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

21

Conclusions

• Binning: Robust version of Naïve Bayes– Often helps, rarely hurts– Smoothing: borrowed from Speech Recognition– Reliable log-likelihood ratios even for small counts:

• airplane: 2 outdoor docs, 0 indoor docs – 14 times more likely to be outdoor than indoor

• Three Evaluations– Indoor vs. Outdoor (mutually exclusive categories)– Events (mutually exclusive categories)– Reuters (many-to-many)