active learning for text classification

ACTIVE LEARNING FOR ACTIVE LEARNING FOR TEXT CLASSIFICATIONTEXT CLASSIFICATIONAnkit Bhutani Y9094

AUTOMATIC TEXT CLASSIFICATION

A FEW HOURS ONLY

MANUAL TEXT CLASSIFICATIONTAKES YEARS

ORGANIZING LARGE ORGANIZING LARGE VOLUMES OF TEXTVOLUMES OF TEXTMassive volume of online text

available.Organisation into categories to

enable efficient search.Find use in a lot of applications like

Data Mining, Automatic Query Answer, Learning User Interest, Making Suggestions, etc.

Learning Approaches : unsupervised, supervised and semi-supervised.

Terms UsedTerms UsedMultinomial Naïve Bayes :

◦Documents in bag of words format◦Independence assumptions

Terms UsedTerms UsedSemi-Supervised Learning :

◦Makes use of Labeled as well as Unlabeled Data to learn the parameters of the model.

Expectation Maximization :◦Class of Iterative Algorithms for

Maximum Likelihood Estimation in problems with incomplete data

Parameters of the model

Document labels

Provide Soft Labels to Documents based on estimated model parameters

Re-estimate the model parameters based on the

soft labels

Terms usedTerms usedActive Learning :

◦Form of supervised machine learning◦Learning Algorithm is able to

interactively query the user◦Query has associated cost.◦Algorithm requests label for document

such that gain in information about model parameters is maximized

But how to choose which DOCUMENT to request for

Label???

Terms UsedTerms UsedQuery by Committee :

◦Divide the training set into 4 – 5 sets.◦Each set as member gives

probability estimates.◦Maximum disagreement measured

by maximum average KL divergence between all pairs

Terms UsedTerms UsedSemi-Supervised Frequency

Estimate (SFE) :◦Slight variation in basic EM :

Different parameters re-estimation formula.

NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningNigam et al, 1998-99 :

◦MNB + EM◦100 Labeled + 2500 Unlabeled

documents◦80 – 85 % accuracy

Nigam & McCullum, 2000 : ◦MNB + EM + Active Learning◦Total 1000 Documents◦Label requests : 50, Accuracy :

~90%

NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningLYRL, 2004 :

◦Compared various Semi-supervised Learning Techniques

◦Introduced Reuters Corpus as a new benchmark

Su Shirabad and Matwin, 2011 : ◦MNB + SFE

My workMy workMNB + SFE + Active Learning

◦Data-set: Reuters Corpus from LYRL 2004: contains around 8 lakh documents

◦Experiments on 10,000 documents starting with : 50 Labeled Documents + 100 requests 100 Labeled Documents + 50 requests

Results so farResults so far

active learning for text classification

Documents

semisupervised learninglyrl

learning user

learning approaches

maximum likelihood estimation

noticable work

maximum disagreement

maximum average

mnb em100