employing em and pool-based active learning for text classification andrew mccallumkamal nigam just...

19
Employing EM and Pool- Based Active Learning for Text Classification Andrew McCallum Kamal Nigam Just Research and Carnegie Mellon University

Upload: roy-rich

Post on 27-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Employing EM and Pool-Based Active Learning for Text

Classification

Andrew McCallum Kamal Nigam

Just Research and Carnegie Mellon University

Page 2: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Text Active Learning

• Many applications

• Scenario: ask for labels of a few documents

• While learning:– Learner carefully selects unlabeled document– Trainer provides label– Learner rebuilds classifier

Page 3: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Query-By-Committee (QBC)

• Label documents with high classification variance

• Iterate:– Create a committee of classifiers– Measure committee disagreement about the class of

unlabeled documents– Select a document for labeling

• Theoretical results promising [Freund et al. 97] [Seung et al. 92]

Page 4: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Text Framework

• “Bag of Words” document representation

• Naïve Bayes classification:

• For each class, estimate P(word|class)

)(

)|()()|(

docP

classwordPclassPdocclassP docword

Page 5: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Outline: Our approach

• Create committee by sampling from distribution over classifiers

• Measure committee disagreement with KL-divergence of the committee members

• Select documents from a large pool using both disagreement and density-weighting

• Add EM to use documents not selected for labeling

Page 6: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Creating Committees

• Each class a distribution of word frequencies

• For each member, construct each class by:– Drawing from the Dirichlet distribution

defined by the labeled data

labeleddata

Classifier distribution

MAP classifier

Member 1

Member 2

Member 3

Committee

Page 7: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Measuring Committee Disagreement

• Kullback-Leibler Divergence to the mean– compares differences in how members “vote”

for classes– Considers entire class distribution of each

member– Considers “confidence” of the top-ranked class

committeek classesc avg

kk cP

cPcPntDisagreeme

)(

)(log)(

Page 8: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Selecting Documents

• Stream-based sampling:– Disagreement => Probability of selection – Implicit (but crude) instance distribution

information

• Pool-based sampling:– Select highest disagreement of all documents– Lose distribution information

Page 9: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Disagreement

Page 10: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Density-weighted pool-based sampling

• A balance of disagreement and distributional information

• Select documents by:

• Calculate Density by:

– (Geometric) Average Distance to

all documents

• )]|(

~)|||([β),Distance( ij dwordPdwordPD

ji edd

)()(maxarg dntDisagreemedDensityunlabeledd

Page 11: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Disagreement

Page 12: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Density

Page 13: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Datasets and Protocol

• Reuters-21578 and subset of Newsgroups

• One initial labeled document per class

• 200 iterations of active learning

macibm

graphicswindows

X

computers

acq corntrade ...

Page 14: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

QBC on Reuters

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

acq, P(+) = 0.25

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

trade, P(+) = 0.038

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

corn, P(+) = 0.018

Page 15: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Selection comparison on News5Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 16: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

EM after Active LearningAfter active learningonly a few documentshave been labeled

Use EM to predict thelabels of the remainingunlabeled documents

Use all documents to build a new classification model, which is often more accurate.

Page 17: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

QBC and EM on News5Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 18: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Related Work

• Active learning with text: – [Dagan & Engelson 95]: QBC Part of speech

tagging– [Lewis & Gale 94]: Pool-based non-QBC– [Liere & Tadepalli 97 & 98]: QBC Winnow &

Perceptrons

• EM with text:– [Nigam et al. 98]: EM with unlabeled data

Page 19: Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

Conclusions & Future Work

• Small P(+) => better active learning

• Leverage unlabeled pool by:– pool-based sampling– density-weighting– Expectation-Maximization

• Different active learning approaches a la [Cohn et al. 96]

• Interleaved EM & active learning