employing em and pool-based active learning for text classification andrew mccallumkamal nigam just...
TRANSCRIPT
Employing EM and Pool-Based Active Learning for Text
Classification
Andrew McCallum Kamal Nigam
Just Research and Carnegie Mellon University
Text Active Learning
• Many applications
• Scenario: ask for labels of a few documents
• While learning:– Learner carefully selects unlabeled document– Trainer provides label– Learner rebuilds classifier
Query-By-Committee (QBC)
• Label documents with high classification variance
• Iterate:– Create a committee of classifiers– Measure committee disagreement about the class of
unlabeled documents– Select a document for labeling
• Theoretical results promising [Freund et al. 97] [Seung et al. 92]
Text Framework
• “Bag of Words” document representation
• Naïve Bayes classification:
• For each class, estimate P(word|class)
)(
)|()()|(
docP
classwordPclassPdocclassP docword
Outline: Our approach
• Create committee by sampling from distribution over classifiers
• Measure committee disagreement with KL-divergence of the committee members
• Select documents from a large pool using both disagreement and density-weighting
• Add EM to use documents not selected for labeling
Creating Committees
• Each class a distribution of word frequencies
• For each member, construct each class by:– Drawing from the Dirichlet distribution
defined by the labeled data
labeleddata
Classifier distribution
MAP classifier
Member 1
Member 2
Member 3
Committee
Measuring Committee Disagreement
• Kullback-Leibler Divergence to the mean– compares differences in how members “vote”
for classes– Considers entire class distribution of each
member– Considers “confidence” of the top-ranked class
•
committeek classesc avg
kk cP
cPcPntDisagreeme
)(
)(log)(
Selecting Documents
• Stream-based sampling:– Disagreement => Probability of selection – Implicit (but crude) instance distribution
information
• Pool-based sampling:– Select highest disagreement of all documents– Lose distribution information
Disagreement
Density-weighted pool-based sampling
• A balance of disagreement and distributional information
• Select documents by:
• Calculate Density by:
– (Geometric) Average Distance to
all documents
• )]|(
~)|||([β),Distance( ij dwordPdwordPD
ji edd
)()(maxarg dntDisagreemedDensityunlabeledd
Disagreement
Density
Datasets and Protocol
• Reuters-21578 and subset of Newsgroups
• One initial labeled document per class
• 200 iterations of active learning
macibm
graphicswindows
X
computers
acq corntrade ...
QBC on Reuters
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
acq, P(+) = 0.25
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
trade, P(+) = 0.038
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
corn, P(+) = 0.018
Selection comparison on News5Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
EM after Active LearningAfter active learningonly a few documentshave been labeled
Use EM to predict thelabels of the remainingunlabeled documents
Use all documents to build a new classification model, which is often more accurate.
QBC and EM on News5Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Related Work
• Active learning with text: – [Dagan & Engelson 95]: QBC Part of speech
tagging– [Lewis & Gale 94]: Pool-based non-QBC– [Liere & Tadepalli 97 & 98]: QBC Winnow &
Perceptrons
• EM with text:– [Nigam et al. 98]: EM with unlabeled data
Conclusions & Future Work
• Small P(+) => better active learning
• Leverage unlabeled pool by:– pool-based sampling– density-weighting– Expectation-Maximization
• Different active learning approaches a la [Cohn et al. 96]
• Interleaved EM & active learning