active learning for text classification

12
ACTIVE LEARNING FOR ACTIVE LEARNING FOR TEXT CLASSIFICATION TEXT CLASSIFICATION Ankit Bhutani Y9094

Upload: latona

Post on 15-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

ACTIVE LEARNING FOR TEXT CLASSIFICATION. Ankit Bhutani Y9094. AUTOMATIC TEXT CLASSIFICATION. A FEW HOURS ONLY. MANUAL TEXT CLASSIFICATION. TAKES YEARS. ORGANIZING LARGE VOLUMES OF TEXT. Massive volume of online text available. Organisation into categories to enable efficient search. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ACTIVE LEARNING FOR TEXT CLASSIFICATION

ACTIVE LEARNING FOR ACTIVE LEARNING FOR TEXT CLASSIFICATIONTEXT CLASSIFICATIONAnkit Bhutani Y9094

Page 2: ACTIVE LEARNING FOR TEXT CLASSIFICATION

AUTOMATIC TEXT CLASSIFICATION

A FEW HOURS ONLY

MANUAL TEXT CLASSIFICATIONTAKES YEARS

Page 3: ACTIVE LEARNING FOR TEXT CLASSIFICATION

ORGANIZING LARGE ORGANIZING LARGE VOLUMES OF TEXTVOLUMES OF TEXTMassive volume of online text

available.Organisation into categories to

enable efficient search.Find use in a lot of applications like

Data Mining, Automatic Query Answer, Learning User Interest, Making Suggestions, etc.

Learning Approaches : unsupervised, supervised and semi-supervised.

Page 4: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Terms UsedTerms UsedMultinomial Naïve Bayes :

◦Documents in bag of words format◦Independence assumptions

Page 5: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Terms UsedTerms UsedSemi-Supervised Learning :

◦Makes use of Labeled as well as Unlabeled Data to learn the parameters of the model.

Expectation Maximization :◦Class of Iterative Algorithms for

Maximum Likelihood Estimation in problems with incomplete data

Parameters of the model

Document labels

Provide Soft Labels to Documents based on estimated model parameters

Re-estimate the model parameters based on the

soft labels

Page 6: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Terms usedTerms usedActive Learning :

◦Form of supervised machine learning◦Learning Algorithm is able to

interactively query the user◦Query has associated cost.◦Algorithm requests label for document

such that gain in information about model parameters is maximized

But how to choose which DOCUMENT to request for

Label???

Page 7: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Terms UsedTerms UsedQuery by Committee :

◦Divide the training set into 4 – 5 sets.◦Each set as member gives

probability estimates.◦Maximum disagreement measured

by maximum average KL divergence between all pairs

Page 8: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Terms UsedTerms UsedSemi-Supervised Frequency

Estimate (SFE) :◦Slight variation in basic EM :

Different parameters re-estimation formula.

Page 9: ACTIVE LEARNING FOR TEXT CLASSIFICATION

NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningNigam et al, 1998-99 :

◦MNB + EM◦100 Labeled + 2500 Unlabeled

documents◦80 – 85 % accuracy

Nigam & McCullum, 2000 : ◦MNB + EM + Active Learning◦Total 1000 Documents◦Label requests : 50, Accuracy :

~90%

Page 10: ACTIVE LEARNING FOR TEXT CLASSIFICATION

NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningLYRL, 2004 :

◦Compared various Semi-supervised Learning Techniques

◦Introduced Reuters Corpus as a new benchmark

Su Shirabad and Matwin, 2011 : ◦MNB + SFE

Page 11: ACTIVE LEARNING FOR TEXT CLASSIFICATION

My workMy workMNB + SFE + Active Learning

◦Data-set: Reuters Corpus from LYRL 2004: contains around 8 lakh documents

◦Experiments on 10,000 documents starting with : 50 Labeled Documents + 100 requests 100 Labeled Documents + 50 requests

Page 12: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Results so farResults so far