active learning for text classification
DESCRIPTION
ACTIVE LEARNING FOR TEXT CLASSIFICATION. Ankit Bhutani Y9094. AUTOMATIC TEXT CLASSIFICATION. A FEW HOURS ONLY. MANUAL TEXT CLASSIFICATION. TAKES YEARS. ORGANIZING LARGE VOLUMES OF TEXT. Massive volume of online text available. Organisation into categories to enable efficient search. - PowerPoint PPT PresentationTRANSCRIPT
ACTIVE LEARNING FOR ACTIVE LEARNING FOR TEXT CLASSIFICATIONTEXT CLASSIFICATIONAnkit Bhutani Y9094
AUTOMATIC TEXT CLASSIFICATION
A FEW HOURS ONLY
MANUAL TEXT CLASSIFICATIONTAKES YEARS
ORGANIZING LARGE ORGANIZING LARGE VOLUMES OF TEXTVOLUMES OF TEXTMassive volume of online text
available.Organisation into categories to
enable efficient search.Find use in a lot of applications like
Data Mining, Automatic Query Answer, Learning User Interest, Making Suggestions, etc.
Learning Approaches : unsupervised, supervised and semi-supervised.
Terms UsedTerms UsedMultinomial Naïve Bayes :
◦Documents in bag of words format◦Independence assumptions
Terms UsedTerms UsedSemi-Supervised Learning :
◦Makes use of Labeled as well as Unlabeled Data to learn the parameters of the model.
Expectation Maximization :◦Class of Iterative Algorithms for
Maximum Likelihood Estimation in problems with incomplete data
Parameters of the model
Document labels
Provide Soft Labels to Documents based on estimated model parameters
Re-estimate the model parameters based on the
soft labels
Terms usedTerms usedActive Learning :
◦Form of supervised machine learning◦Learning Algorithm is able to
interactively query the user◦Query has associated cost.◦Algorithm requests label for document
such that gain in information about model parameters is maximized
But how to choose which DOCUMENT to request for
Label???
Terms UsedTerms UsedQuery by Committee :
◦Divide the training set into 4 – 5 sets.◦Each set as member gives
probability estimates.◦Maximum disagreement measured
by maximum average KL divergence between all pairs
Terms UsedTerms UsedSemi-Supervised Frequency
Estimate (SFE) :◦Slight variation in basic EM :
Different parameters re-estimation formula.
NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningNigam et al, 1998-99 :
◦MNB + EM◦100 Labeled + 2500 Unlabeled
documents◦80 – 85 % accuracy
Nigam & McCullum, 2000 : ◦MNB + EM + Active Learning◦Total 1000 Documents◦Label requests : 50, Accuracy :
~90%
NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningLYRL, 2004 :
◦Compared various Semi-supervised Learning Techniques
◦Introduced Reuters Corpus as a new benchmark
Su Shirabad and Matwin, 2011 : ◦MNB + SFE
My workMy workMNB + SFE + Active Learning
◦Data-set: Reuters Corpus from LYRL 2004: contains around 8 lakh documents
◦Experiments on 10,000 documents starting with : 50 Labeled Documents + 100 requests 100 Labeled Documents + 50 requests
Results so farResults so far