1 cpe 641 natural language processing asst. prof. nuttanart facundes text classification adapted...

49
1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

Upload: gladys-wright

Post on 29-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

1

CPE 641 Natural Language Processing

Asst. Prof. Nuttanart FacundesText Classification  

 

Adapted from Barbara Rosario’s slides – Sept. 27,2004

Page 2: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

2

Classification

Classification Text categorization (and other applications)

Various issues regarding classificationClustering vs. classification, binary vs. multi-way, flat vs. hierarchical classification…

Introduce the steps necessary for a classification task

Define classesLabel textFeaturesTraining and evaluation of a classifier

Page 3: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

3From: Foundations of Statistical Natural Language Processing. Manning and Schutze

Classification

Goal: Assign ‘objects’ from a universe to two or more classes or categories

Examples:

Problem Object CategoriesTagging Word POSSense Disambiguation Word The word’s sensesInformation retrieval Document Relevant/not relevantSentiment classification Document Positive/negativeAuthor identification Document Authors

Page 4: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

4

Author identification

They agreed that Mrs. X should only hear of the departure of the family, without being alarmed on the score of the gentleman's conduct; but even this partial communication gave her a great deal of concern, and she bewailed it as exceedingly unlucky that the ladies should happen to go away, just as they were all getting so intimate together.Gas looming through the fog in divers places in the streets, much as the sun may, from the spongey fields, be seen to loom by husbandman and ploughboy. Most of the shops lighted two hours before their time--as the gas seems to know, for it has a haggard and unwilling look. The raw afternoon is rawest, and the dense fog is densest, and the muddy streets are muddiest near that leaden-headed old obstruction, appropriate ornament for the threshold of a leaden-headed old corporation, Temple Bar.

Page 5: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

5

Author identification

Jane Austen (1775-1817), Pride and Prejudice

Charles Dickens (1812-70), Bleak House

Page 6: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

6Mosteller, Frederick and Wallace, David L. 1964. Inference and Disputed Authorship: The Federalist.

Author identification

Federalist papers 77 short essays written in 1787-1788 by Hamilton, Jay and Madison to persuade NY to ratify the US Constitution; published under a pseudonymThe authorships of 12 papers was in dispute (disputed papers)In 1964 Mosteller and Wallace* solved the problemThey identified 70 function words as good candidates for authorships analysis Using statistical inference they concluded the author was Madison

Page 7: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

7

Function words for Author Identification

Page 8: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

8

Function words for Author Identification

Page 9: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

9From: Foundations os Statistical Natural Language Processing. Manning and Schutze

Classification

Goal: Assign ‘objects’ from a universe to two or more classes or categories

Examples:

Problem Object Categories

Author identification Document AuthorsLanguage identification Document Language

Page 10: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

10

Language identification

Tutti gli esseri umani nascono liberi ed eguali in dignità e diritti. Essi sono dotati di ragione e di coscienza e devono agire gli uni verso gli altri in spirito di fratellanza.

Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen.

Universal Declaration of Human Rights, UN, in 363 languages

Page 11: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

11

Language identification

égaux

eguali

iguales

edistämään

Ü

¿

Page 12: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

12From: Foundations of Statistical Natural Language Processing. Manning and Schutze

Classification

Goal: Assign ‘objects’ from a universe to two or more classes or categories

Examples:

Problem Object Categories

Author identification Document AuthorsLanguage identification Document LanguageText categorization Document Topics

Page 13: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

13

Text categorization

Topic categorization: classify the document into semantics topics

The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie.

One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.

Page 14: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

14

Text categorization

http://news.google.com/Reuters

Collection of (21,578) newswire documents. For research purposes: a standard text collection to compare systems and algorithms135 valid topics categories

Page 15: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

15

Reuters

Top topics in Reuters

Page 16: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

16

Reuters

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>

Page 17: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

17

Text categorization: examples

Topic categorizationhttp://news.google.com/ Reuters.

Spam filteringDetermine if a mail message is spam (or not)

Customer service message classification

Page 18: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

18

Classification vs. Clustering

Classification assumes labeled data: we know how many classes there are and we have examples for each class (labeled data). Classification is supervisedIn Clustering we don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there areClustering is unsupervised

Page 19: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

19

Classification

Class1

Class2

Page 20: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

20

Classification

Class1

Class2

Page 21: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

21

Classification

Class1

Class2

Page 22: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

22

Classification

Class1

Class2

Page 23: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

23

Clustering

Page 24: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

24

Clustering

Page 25: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

25

Clustering

Page 26: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

26

Clustering

Page 27: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

27

Clustering

Page 28: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

28

Categories (Labels, Classes)

Labeling data2 problems: Decide the possible classes (which ones, how many)

Domain and application dependenthttp://news.google.com

Label textDifficult, time consuming, inconsistency between annotators

Page 29: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

29

Reuters<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>

Why not topic = policy ?

Page 30: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

30

Binary vs. multi-way classification

Binary classification: two classes

Multi-way classification: more than two classes

Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes

Page 31: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

31

Flat vs. Hierarchical classification

Flat classification: relations between the classes undetermined

Hierarchical classification: hierarchy where each node is the sub-class of its parent’s node

Page 32: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

32

Single- vs. multi-category classification

In single-category text classification each text belongs to exactly one category

In multi-category text classification, each text can have zero or more categories

Page 33: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

33

LabeledText class in NLTK

LabeledText class

>>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix." >>> label = “sport” >>> labeled_text = LabeledText(text, label) >>> labeled_text.text() “Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix.”>>> labeled_text.label() “sport”

Page 34: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

34

NLTK: The Classifier Interface

classify determines which label is most appropriate for a given text token, and returns a labeled text token with that label. labels returns the list of category labels that are used by the classifier. >>> token = Token(“The World Health Organization is recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”)>>> my_classifier.classify(token)

“The World Health Organization is recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”/ health

>>> my_classifier.labels() ("sport", "health", "world",…)

Page 35: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

35

Features

>>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix."

>>> label = “sport”

>>> labeled_text = LabeledText(text, label)

Here the classification takes as input the whole stringWhat’s the problem with that?What are the features that could be useful for this example?

Page 36: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

36

Feature terminology

Feature: An aspect of the text that is relevant to the task Some typical features

Words present in text Frequency of words CapitalizationAre there NE?WordNet Others?

Page 37: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

37

Feature terminology

Feature: An aspect of the text that is relevant to the taskFeature value: the realization of the feature in the text

Words present in text : Kerry, Schumacher, China…

Frequency of word: Kerry(10), Schumacher(1)…Are there dates? Yes/noAre there PERSONS? Yes/noAre there ORGANIZATIONS? Yes/noWordNet: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China)

Page 38: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

38

Feature Types

Boolean (or Binary) FeaturesFeatures that generate boolean (binary) values. Boolean features are the simplest and the most common type of feature.

f1(text) = 1 if text contain “Kerry”

0 otherwise

f2(text) = 1 if text contain PERSON

0 otherwise

Page 39: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

39

Feature Types

Integer FeaturesFeatures that generate integer values. Integer features can be used to give classifiers access to more precise information about the text.

f1(text) = Number of times text contains “Kerry”

f2(text) = Number of times text contains PERSON

Page 40: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

40

Features in NLTK

Feature Detectors Features can be defined using feature detector functions, which map LabeledTexts to valuesMethod: detect, which takes a labeled text, and returns a feature value.

>>> def ball(ltext): return (“ball” in ltext.text())

>>> fdetector = FunctionFeatureDetector(ball) >>> document1 = "John threw the ball over the fence".split()>>> fdetector.detect(LabeledText(document1)

1>>> document2 = "Mary solved the equation".split()>>> fdetector.detect(LabeledText(document2)

0

Page 41: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

42

Feature selection

How do we choose the “right” features?

Page 42: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

43

Classification

Define classesLabel textExtract FeaturesChoose a classifier

>>> my_classifier.classify(token)

The Naive Bayes Classifier NN (perceptron)SVM….

Train it (and test it)Use it to classify new examples

Page 43: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

44

Training

• (We’ll see what we mean exactly with training when we’ll talk about the algorithms)

• Adaptation of the classifier to the data• Usually the classifier is defined by a set of

parameters• Training is the procedure for finding a “good”

set of parameters• Goodness is determined by an optimization

criterion such as misclassification rate• Some classifiers are guaranteed to find the

optimal set of parameters

Page 44: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

45

Testing, evaluation of the classifier

After choosing the parameters of the classifiers (i.e. after training it) we need to test how well it’s doing on a test set (not included in the training set)

Calculate misclassification on the test set

Page 45: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

46

Evaluating classifiers

Contingency table for the evaluation of a binary classifier

GREEN is correct

RED is correct

GREEN was assigned

a b

RED was assigned c d

Accuracy = (a+d)/(a+b+c+d)Precision: P_GREEN = a/(a+b), P_ RED = d/(c+d)Recall: R_GREEN = a/(a+c), R_ RED = d/(b+d)

Page 46: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

47*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang

Training sizeThe more the better! (usually)Results for text classification*

Page 47: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

48*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang

Training size

Page 48: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

49*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang

Training size

Page 49: 1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

50Authorship Attribution a Comparison Of Three Methods, Matthew Care

Training Size

Author identification