classifying text

31
Classifying text NLTK Chapter 6

Upload: druce

Post on 15-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Classifying text. NLTK Chapter 6. Chapter 6 topics. How can we identify particular features of language data that are salient for classifying it? How can we construct models of language that can be used to perform language processing tasks automatically? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Classifying text

Classifying text

NLTK Chapter 6

Page 2: Classifying text

Chapter 6 topics

• How can we identify particular features of language data that are salient for classifying it?

• How can we construct models of language that can be used to perform language processing tasks automatically?

• What can we learn about language from these models?

Page 3: Classifying text

From words to larger units

• We looked at how words are indentified with a part of speech. That is an essential part of “understanding” textual material

• Now, how can we classify whole documents.– These techniques are used for spam detection, for

identifying the subject matter of a news feed, and for many other tasks related to categorizing text

Page 4: Classifying text

A supervised classifier

We saw a smaller version of this in our part of speech taggers

Page 5: Classifying text

Case studyMale and female names

• Note this is language biased (English)• These distinctions are harder given modern

naming conventions– I have a granddaughter named Sydney, for

example

Page 6: Classifying text

Step 1: features and encoding• Deciding what features to look for and how to

represent those features is the first step, and is critical.– All the training and classification will be based on

these decisions• Initial choice for name identification: look at the

last letter:>>> def gender_features(word):... return {'last_letter': word[-1]}>>> gender_features('Shrek'){'last_letter': 'k'} returns a dictionary (note the { } ) with a feature

name and the corresponding value

Page 7: Classifying text

First gender check

import nltkdef gender_features(word): return {'last_letter':word[-1]}

name=raw_input("What name shall we check?")features=gender_features(name)print "Gender features for ", name, ":", features

Page 8: Classifying text

Step 2: Provide training values• We provide a list of examples and their

corresponding feature values. >>> from nltk.corpus import names>>> import random>>> names = ([(name,'male') for name in names.words('male.txt')] + ... [(name, 'female') for name in names.words('female.txt')])>>> random.shuffle(names)>>> names[('Kate', 'female'), ('Eleonora', 'female'), ('Germaine', 'male'), ('Helen', 'female'), ('Rachelle', 'female'), ('Nanci', 'female'), ('Aleta', 'female'), ('Catherin', 'female'), ('Clementia', 'female'), ('Keslie', 'female'), ('Callida', 'female'), ('Horatius', 'male'), ('Kraig', 'male'), ('Cindra', 'female'), ('Jayne', 'female'), ('Fortuna', 'female'), ('Yovonnda', 'female'), ('Pam', 'female'), ('Vida', 'female'), ('Margurite', 'female'), ('Maryellen', 'female'), …

Page 9: Classifying text

• Try it. Apply the classifier to your name:

• Try it on the test data and see how it does:

>>> featuresets = [(gender_features(n), g) for (n,g) in names]>>> train_set, test_set = featuresets[500:], featuresets[:500]>>> classifier = nltk.NaiveBayesClassifier.train(train_set)

>>> classifier.classify(gender_features('Sydney'))'female'

>>> print nltk.classify.accuracy(classifier, test_set)0.758

Page 10: Classifying text

Your turn

• Modify the gender_features function to look at more of the name than the last letter. Does it help to look at the last two letters? the first letter? the length of the name? Try a few variations

Page 11: Classifying text

What is most useful• There is even a function to show what was most

useful in the classification:

>>> classifier.show_most_informative_features(10)Most Informative Featureslast_letter = 'k' male : female = 45.7 : 1.0last_letter = 'a' female : male = 38.4 : 1.0last_letter = 'f' male : female = 28.7 : 1.0last_letter = 'v' male : female = 11.2 : 1.0last_letter = 'p' male : female = 11.2 : 1.0last_letter = 'd' male : female = 9.8 : 1.0last_letter = 'm' male : female = 8.9 : 1.0last_letter = 'o' male : female = 8.3 : 1.0last_letter = 'r' male : female = 6.7 : 1.0last_letter = 'g' male : female = 5.6 : 1.0

Page 12: Classifying text

What features to use

• Overfitting– Being too specific about the characteristics that

you search for– Picks up idiosyncrasies of the training data and

may not transfer well to the test data• Choose an initial feature set and then test.

The chair example.What features would you use?

Page 13: Classifying text

Dev test• Divide the corpus into three parts: training,

development testing, final testing

Page 14: Classifying text

Testing stages

>>> train_set = [(gender_features(n), g) for (n,g) in train_names]>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]>>> test_set = [(gender_features(n), g) for (n,g) in test_names]>>> classifier = nltk.NaiveBayesClassifier.train(train_set) >>> print nltk.classify.accuracy(classifier, devtest_set) 0.765

>>> train_names = names[1500:]>>> devtest_names = names[500:1500]>>> test_names = names[:500]

Accuracy noted, but where were the problems?

From 1500 to end

First 500 itemsRecall

Page 15: Classifying text

import nltkfrom nltk.corpus import namesimport random

def gender_features(word): return {'last_letter':word[-1]}

names = ([(name, 'male') for name in names.words('male.txt')] + \ [(name, 'female') for name in names.words('female.txt')])random.shuffle(names)

print "Number of names: ", len(names)

train_names=names[1500:]devtest_names=names[500:1500]test_names = names[:500]

train_set=[(gender_features(n),g) for (n,g) in train_names]devtest_set=[(gender_features(n),g) for (n,g) in devtest_names]test_set = [(gender_features(n),g) for (n,g) in test_names]classifier = nltk.NaiveBayesClassifier.train(train_set)print nltk.classify.accuracy(classifier,devtest_set)

print classifier.show_most_informative_features(10)

Page 16: Classifying text

Output from previous code

Number of names: 79440.771Most Informative Features last_letter = 'k' male : female = 39.7 : 1.0 last_letter = 'a' female : male = 31.4 : 1.0 last_letter = 'f' male : female = 16.0 : 1.0 last_letter = 'v' male : female = 14.1 : 1.0 last_letter = 'd' male : female = 10.3 : 1.0 last_letter = 'p' male : female = 9.8 : 1.0 last_letter = 'm' male : female = 8.6 : 1.0 last_letter = 'o' male : female = 7.8 : 1.0 last_letter = 'r' male : female = 6.6 : 1.0 last_letter = 'w' male : female = 4.8 : 1.0

Page 17: Classifying text

Checking where the errors are

• Next slide

Page 18: Classifying text

import nltkfrom nltk.corpus import namesimport randomdef gender_features(word): return {'last_letter':word[-1]}names = ([(name, 'male') for name in names.words('male.txt')] + \ [(name, 'female') for name in names.words('female.txt')])random.shuffle(names)print "Number of names: ", len(names)train_names=names[1500:]devtest_names=names[500:1500]test_names = names[:500]train_set=[(gender_features(n),g) for (n,g) in train_names]devtest_set=[(gender_features(n),g) for (n,g) in devtest_names]test_set = [(gender_features(n),g) for (n,g) in test_names]classifier = nltk.NaiveBayesClassifier.train(train_set)print "Look for error cases:”errors = []for (name,tag) in devtest_names: guess = classifier.classify(gender_features(name)) if guess != tag: errors.append((tag, guess, name))for (tag, guess, name) in sorted(errors): print 'correct= %-8s guess= %-8s name =%-30s'%(tag,guess,name)print "Number of errors: ", len(errors)print nltk.classify.accuracy(classifier,devtest_set)

Page 19: Classifying text

• Check the classifier against the known values and see where it failed:

Number of names: 7944Look for error cases:correct= female guess= male name =Abagail correct= female guess= male name =Adrian correct= female guess= male name =Alex correct= female guess= male name =Amargo correct= female guess= male name =Anabel correct= female guess= male name =Annabal correct= female guess= male name =Annabel correct= female guess= male name =Arabel correct= female guess= male name =Ardelis …

Page 20: Classifying text

Finding the error cases

• Look through the list of error cases. • Do you see any patterns?• Are there adjustments that we could make in

our feature extractor to make it more accurate?

Page 21: Classifying text

Error analysis

• It turns out that using the last two letters improves the accuracy.

• Did you find that in your experimentation?

Page 22: Classifying text

Summarize the process• Train on a subset of the available data– Look for characteristics that relate to the “right” answer.

Write the feature extractor to look at those characteristics• Run the classifier on other data – whose characteristics

are known! – to see how well it performs– You have to know the answers to know whether the classifier

got them right.• When satisfied with the performance of the classifier,

run it on new data for which you do not know the answer.– How confident can you be?

The disease example. If 98% of your cases are disease free …

Page 23: Classifying text

Document classification• So far, classified names as Male/Female– Not much to work with, not much to look at

• Now, look at whole documents– How can you classify a document?– Subject matter in a syllabus collection, positive and

negative movie/restaurant/other reviews, bias in a summary or review, subject matter in a news feed, separate works by author, …

• Case study, classifying movie reviews

Page 24: Classifying text

Classifying documents

• To classify words (names), we looked at letters.

• Feature extraction for documents will use words

• Find the most common words in the document set and see which words are in which types of documents

Page 25: Classifying text

import nltkimport randomfrom nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), \ category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]random.shuffle(documents)

cats = list(cat for cat in \ movie_reviews.categories())print "Movie review Categories:", catsprint "Number of reviews:", len(documents)

Page 26: Classifying text

Feature extractor. Are the words present in the documents

import nltkimport randomfrom nltk.corpus import movie_reviewsdocuments = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]random.shuffle(documents)

all_words= nltk.FreqDist(w.lower() for w in \ movie_reviews.words())word_features = all_words.keys()[:2000]

def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)'% word] = (word in document_words) return features

print document_features(movie_reviews.words('pos/cv957_8737.txt'))

Line by line, what does this do?

This is something different, but we have seen its like before

What is this?

Page 27: Classifying text

And if you are not sure …

• What do you do?– Enter the code and run it– Go to a search engine and type “Python <issue

description>”

Page 28: Classifying text

Compute accuracy and see what are the most useful feature values

featuresets = [(document_features(d), c) for (d,c) in documents]train_set, test_set = featuresets[100:], featuresets[:100]classifier = nltk.NaiveBayesClassifier.train(train_set) 0.81Most Informative Features contains(outstanding) = True pos : neg = 11.1 : 1.0 contains(seagal) = True neg : pos = 8.3 : 1.0 contains(mulan) = True pos : neg = 8.3 : 1.0 contains(damon) = True pos : neg = 8.1 : 1.0 contains(wonderfully) = True pos : neg = 6.8 : 1.0

• Just as we did with classifying names• Create a feature set• Create a training set and a testing set• Apply to new data

Page 29: Classifying text

import nltkimport randomfrom nltk.corpus import movie_reviewsdocuments = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]random.shuffle(documents)

all_words= nltk.FreqDist(w.lower() for w in movie_reviews.words())word_features = all_words.keys()[:2000]

def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)'% word] = (word in document_words) return features

featuresets = [(document_features(d), c) for (d,c) in documents]train_set, test_set = featuresets[100:], featuresets[:100]classifier = nltk.NaiveBayesClassifier.train(train_set)

print nltk.classify.accuracy(classifier, test_set)print classifier.show_most_informative_features(5)

Full code for this example

Page 30: Classifying text

From the text• This note from the text attracted my

attention:

• What does that suggest?

NoteThe reason that we compute the set of all words in a document in <figure reference>, rather than just checking if word in document, is that checking whether a word occurs in a set is much faster than checking whether it occurs in a list (4.7).

Page 31: Classifying text

The time has come …

• We have learned a lot of Python• Something about object-oriented

programming• A bit about Text Analysis• A bit about network programming, web

crawling, servers, etc.• There is lots more to all of those subjects.

I am happy to review or discuss anything we did this semester.If you are doing some Python programming later and want to discuss it, I will be happy to talk to you about it.