employing em and pool-based active learning for text classification andrew mccallumkamal nigam just...

Employing EM and Pool-Based Active Learning for Text

Classification

Andrew McCallum Kamal Nigam

Just Research and Carnegie Mellon University

Text Active Learning

• Many applications

• Scenario: ask for labels of a few documents

• While learning:– Learner carefully selects unlabeled document– Trainer provides label– Learner rebuilds classifier

Query-By-Committee (QBC)

• Label documents with high classification variance

• Iterate:– Create a committee of classifiers– Measure committee disagreement about the class of

unlabeled documents– Select a document for labeling

• Theoretical results promising [Freund et al. 97] [Seung et al. 92]

Text Framework

• “Bag of Words” document representation

• Naïve Bayes classification:

• For each class, estimate P(word|class)

)(

)|()()|(

docP

classwordPclassPdocclassP docword

Outline: Our approach

• Create committee by sampling from distribution over classifiers

• Measure committee disagreement with KL-divergence of the committee members

• Select documents from a large pool using both disagreement and density-weighting

• Add EM to use documents not selected for labeling

Creating Committees

• Each class a distribution of word frequencies

• For each member, construct each class by:– Drawing from the Dirichlet distribution

defined by the labeled data

labeleddata

Classifier distribution

MAP classifier

Member 1

Member 2

Member 3

Committee

Measuring Committee Disagreement

• Kullback-Leibler Divergence to the mean– compares differences in how members “vote”

for classes– Considers entire class distribution of each

member– Considers “confidence” of the top-ranked class

•

committeek classesc avg

kk cP

cPcPntDisagreeme

)(

)(log)(

Selecting Documents

• Stream-based sampling:– Disagreement => Probability of selection – Implicit (but crude) instance distribution

information

• Pool-based sampling:– Select highest disagreement of all documents– Lose distribution information

Disagreement

Density-weighted pool-based sampling

• A balance of disagreement and distributional information

• Select documents by:

• Calculate Density by:

– (Geometric) Average Distance to

all documents

• )]|(

~)|||([β),Distance( ij dwordPdwordPD

ji edd

)()(maxarg dntDisagreemedDensityunlabeledd

Disagreement

Density

Datasets and Protocol

• Reuters-21578 and subset of Newsgroups

• One initial labeled document per class

• 200 iterations of active learning

macibm

graphicswindows

X

computers

acq corntrade ...

QBC on Reuters

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

acq, P(+) = 0.25

Title:


trade, P(+) = 0.038

Title:


corn, P(+) = 0.018

Selection comparison on News5Title:


EM after Active LearningAfter active learningonly a few documentshave been labeled

Use EM to predict thelabels of the remainingunlabeled documents

Use all documents to build a new classification model, which is often more accurate.

QBC and EM on News5Title:


Related Work

• Active learning with text: – [Dagan & Engelson 95]: QBC Part of speech

tagging– [Lewis & Gale 94]: Pool-based non-QBC– [Liere & Tadepalli 97 & 98]: QBC Winnow &

Perceptrons

• EM with text:– [Nigam et al. 98]: EM with unlabeled data

Conclusions & Future Work

• Small P(+) => better active learning

• Leverage unlabeled pool by:– pool-based sampling– density-weighting– Expectation-Maximization

• Different active learning approaches a la [Cohn et al. 96]

• Interleaved EM & active learning

employing em and pool-based active learning for text classification andrew mccallumkamal nigam just...

Documents

disagreement slide

density slide

classifier slide

news5 slide

labeling slide

pwordclass slide

distribution information

unlabeled data slide