classifying text

19
Classifying text NLTK Chapter 6

Upload: havily

Post on 23-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Classifying text. NLTK Chapter 6. Chapter 6 topics. How can we identify particular features of language data that are salient for classifying it? How can we construct models of language that can be used to perform language processing tasks automatically? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Classifying text

Classifying text

NLTK Chapter 6

Page 2: Classifying text

Chapter 6 topics• How can we identify particular features

of language data that are salient for classifying it?

• How can we construct models of language that can be used to perform language processing tasks automatically?

• What can we learn about language from these models?

Page 3: Classifying text

From words to larger units• We looked at how words are indentified

with a part of speech. That is an essential part of “understanding” textual material

• Now, how can we classify whole documents.– These techniques are used for spam

detection, for identifying the subject matter of a news feed, and for many other tasks related to categorizing text

Page 4: Classifying text

A supervised classifier

We saw a smaller version of this in our part of speech taggers

Page 5: Classifying text

Case studyMale and female names

• Note this is language biased (English)• These distinctions are harder given

modern naming conventions– I have a granddaughter named Sydney, for

example

Page 6: Classifying text

Step 1: features and encoding• Deciding what features to look for and how

to represent those features is the first step, and is critical.– All the training and classification will be based

on these decisions• Initial choice for name identification: look

at the last letter:>>> def gender_features(word):... return {'last_letter': word[-1]}>>> gender_features('Shrek'){'last_letter': 'k'}

returns a dictionary (note the { } ) with a feature name and the corresponding value

Page 7: Classifying text

Step 2: Provide training values• We provide a list of examples and their

corresponding feature values. >>> from nltk.corpus import names>>> import random>>> names = ([(name,'male') for name in names.words('male.txt')] + ... [(name, 'female') for name in names.words('female.txt')])>>> random.shuffle(names)>>> names[('Kate', 'female'), ('Eleonora', 'female'), ('Germaine', 'male'), ('Helen', 'female'), ('Rachelle', 'female'), ('Nanci', 'female'), ('Aleta', 'female'), ('Catherin', 'female'), ('Clementia', 'female'), ('Keslie', 'female'), ('Callida', 'female'), ('Horatius', 'male'), ('Kraig', 'male'), ('Cindra', 'female'), ('Jayne', 'female'), ('Fortuna', 'female'), ('Yovonnda', 'female'), ('Pam', 'female'), ('Vida', 'female'), ('Margurite', 'female'), ('Maryellen', 'female'), …

Page 8: Classifying text

• Try it. Apply the classifier to your name:

• Try it on the test data and see how it does:

>>> featuresets = [(gender_features(n), g) for (n,g) in names]>>> train_set, test_set = featuresets[500:], featuresets[:500]>>> classifier = nltk.NaiveBayesClassifier.train(train_set)

>>> classifier.classify(gender_features('Sydney'))'female'

>>> print nltk.classify.accuracy(classifier, test_set)0.758

Page 9: Classifying text

Your turn• Modify the gender_features function to

look at more of the name than the last letter. Does it help to look at the last two letters? the first letter? the length of the name? Try a few variations

Page 10: Classifying text

What is most useful• There is even a function to show what

was most useful in the classification:

>>> classifier.show_most_informative_features(10)Most Informative Featureslast_letter = 'k' male : female = 45.7 : 1.0last_letter = 'a' female : male = 38.4 : 1.0last_letter = 'f' male : female = 28.7 : 1.0last_letter = 'v' male : female = 11.2 : 1.0last_letter = 'p' male : female = 11.2 : 1.0last_letter = 'd' male : female = 9.8 : 1.0last_letter = 'm' male : female = 8.9 : 1.0last_letter = 'o' male : female = 8.3 : 1.0last_letter = 'r' male : female = 6.7 : 1.0last_letter = 'g' male : female = 5.6 : 1.0

Page 11: Classifying text

What features to use• Overfitting– Being too specific about the characteristics

that you search for– Picks up idiosyncrasies of the training data

and may not transfer well to the test data• Choose an initial feature set and then

test.

Page 12: Classifying text

Dev test• Divide the corpus into three parts: training,

development testing, final testing

Page 13: Classifying text

Testing stages

>>> train_set = [(gender_features(n), g) for (n,g) in train_names]>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]>>> test_set = [(gender_features(n), g) for (n,g) in test_names]>>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> print nltk.classify.accuracy(classifier, devtest_set) 0.765

>>> train_names = names[1500:]>>> devtest_names = names[500:1500]>>> test_names = names[:500]

Accuracy noted, but where were the problems?

Page 14: Classifying text

• Check the classifier against the known values and see where it failed:

>>> errors = []>>> for (name, tag) in devtest_names:... guess = classifier.classify(gender_features(name))... if guess != tag:... errors.append( (tag, guess, name) )

>>> for (tag, guess, name) in sorted(errors): ... print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)...correct=female guess=male name=Cindely ...correct=female guess=male name=Katheryncorrect=female guess=male name=Kathryn ...correct=male guess=female name=Aldrich ...correct=male guess=female name=Mitch

Page 15: Classifying text

Error analysis• It turns out that using the last two letters

improves the accuracy. • Did you find that in your

experimentation?

Page 16: Classifying text

Document classification• Many uses.• Case study, classifying movie reviews >>> from nltk.corpus import movie_reviews>>> documents = [(list(movie_reviews.words(fileid)), category)... for category in movie_reviews.categories()... for fileid in movie_reviews.fileids(category)]>>> random.shuffle(documents)

• Feature extraction for documents will use words• Find most common words in the document set

and see which words are in which types of documents

Page 17: Classifying text

Feature extractor. Are the words present in the documents

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())word_features = all_words.keys()[:2000]

def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features >>> print document_features(movie_reviews.words('pos/cv957_8737.txt')) {'contains(waste)': False, 'contains(lot)': False, ...}

Page 18: Classifying text

Compute accuracy and see what are the most useful feature values

featuresets = [(document_features(d), c) for (d,c) in documents]train_set, test_set = featuresets[100:], featuresets[:100]classifier = nltk.NaiveBayesClassifier.train(train_set) >>> print nltk.classify.accuracy(classifier, test_set) 0.81>>> classifier.show_most_informative_features(5) Most Informative Features contains(outstanding) = True pos : neg = 11.1 : 1.0 contains(seagal) = True neg : pos = 7.7 : 1.0 contains(wonderfully) = True pos : neg = 6.8 : 1.0 contains(damon) = True pos : neg = 5.9 : 1.0 contains(wasted) = True neg : pos = 5.8 : 1.0

Page 19: Classifying text

There is more• As time allows, let’s look at other

sections of this chapter. We do not have time to do justice to all the topics, but we can take a few and look into them.