1 sims 290-2: applied natural language processing preslav nakov october 6, 2004

67
1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

Post on 20-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

1

SIMS 290-2: Applied Natural Language Processing

Preslav NakovOctober 6, 2004 

 

Page 2: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

2

Today

The 20 Newsgroups Text Collection

WEKA: Exporer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

Page 3: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

3

The 20 Newsgroups Text Collection

WEKA: Exporer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

Page 4: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

4

Source: originally collected by Ken LangContent and structure:

approximately 20,000 newsgroup documents– 19,997 originally– 18,828 without duplicates

partitioned evenly across 20 different newsgroups

Some categories are strongly related (and thus hard to discriminate):

20 Newsgroups Data Sethttp://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/

comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.x

rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey

sci.cryptsci.electronicssci.medsci.space

misc.forsale talk.politics.misctalk.politics.gunstalk.politics.mideast

talk.religion.miscalt.atheismsoc.religion.christian

computers

Page 5: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

5

Sample Posting: “talk.politics.guns”From: [email protected] (C. D. Tavares)Subject: Re: Congress to review ATF's status

In article <[email protected]>, [email protected] (Larry Cipriani) writes:

> WASHINGTON (UPI) -- As part of its investigation of the deadly> confrontation with a Texas cult, Congress will consider whether the> Bureau of Alcohol, Tobacco and Firearms should be moved from the> Treasury Department to the Justice Department, senators said Wednesday.> The idea will be considered because of the violent and fatal events> at the beginning and end of the agency's confrontation with the Branch> Davidian cult.

Of course. When the catbox begines to smell, simply transfer itscontents into the potted plant in the foyer.

"Why Hillary! Your government smells so... FRESH!"--

[email protected] --If you believe that I speak for my company,OR [email protected] write today for my special Investors' Packet...

reply

from

subject

signature

Need special handling during

feature extraction…

… writes:

Page 6: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

6

The 20 Newsgroups Text Collection

WEKA: Exporer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

Page 7: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

7Slide adapted from Eibe Frank's

WEKA: The Bird

Copyright: Martin Kramer ([email protected]), University of Waikato, New Zealand

Page 8: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

8

WEKA: Terminology

Some synonyms/explanations for the terms used by WEKA, which may differ from what we adopted:

Attribute: feature Relation: collection of examples Instance: collection in use Class: category

Page 9: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

9Slide adapted from Eibe Frank's

WEKA: The Software Toolkit

Machine learning/data mining software in JavaGNU LicenseUsed for research, education and applicationsComplements “Data Mining” by Witten & FrankMain features:

data pre-processing tools learning algorithms evaluation methods graphical interface (incl. data visualization) environment for comparing learning algorithms

http://www.cs.waikato.ac.nz/ml/weka

Page 10: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

10Slide adapted from Eibe Frank's

WEKA GUI Chooser java -Xmx1000M -jar weka.jar

Page 11: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

11Slide adapted from Eibe Frank's

Our Toy Example

We demonstrate WEKA on a toy example:

3 categories from “20 Newsgroups”:– misc.forsale, – rec.sport.hockey, – comp.graphics

20 documents per category features:– words converted to lowercase– frequency 2 or more required– stopwords removed

Page 12: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

12Slide adapted from Eibe Frank's

Explorer: Pre-Processing The Data

WEKA can import data is from:files: ARFF, CSV, C4.5, binaryURL SQL database (using JDBC)

Pre-processing tools (filters) are used for:Discretization, normalization, resampling, attribute selection, transforming and combining attributes, etc.

Page 13: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

13

List of attributes (last: class variable)

Frequency and categories for the selected

attribute

Statistics about the values of the selected attribute

Classification

Filter selection

Manual attribute selection

Statistical attribute selection

Preprocessing

The Preprocessing Tab

Page 14: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

14Slide adapted from Eibe Frank's

Explorer: Building “Classifiers”

Classifiers in WEKA are models for:classification (predict a nominal class)regression (predict a numerical quantity)

Learning algorithms:Naïve Bayes, decision trees, kNN, support vector machines, multi-layer perceptron, logistic regression, etc.

Meta-classifiers:cannot be used alonealways combined with a learning algorithmexamples: boosting, bagging etc.

Page 15: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

15

Choice of classifier

The attribute whose value is to be predicted from the values of the remaining ones.

Default is the last attribute.

Here (in our toy example) it is

named “class”.

Cross-validation: split the data into e.g. 10 folds and

10 times train on 9 folds and test on the remaining one

The Classification Tab

Page 16: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

16

Choosing a classifier

Page 17: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

17

Page 18: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

18

False: Gaussian

True: kernels (better)

displays synopsis and options

numerical to nominal

conversion by discretization

outputs additional information

Page 19: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

19

Page 20: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

20

Page 21: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

21

all other numbers can be obtained from it

different/easy class

accuracy

Page 22: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

22

Contains information about the actual and the predicted classification

All measures can be derived from it: accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b/(a+b) true negative (TN) rate: a/(a+b) false negative (FN) rate: c/(c+d)

These extend for more than 2 classes: see previous lecture slides for details

Confusion matrix

predicted

– +

true

– a b

+ c d

Page 23: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

23

Outputs the probability

distribution for each example

Predictions Output

Page 24: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

24

Probability distribution for

a wrong example:

predicted 1 instead of 3

Naïve Bayes makes incorrect

conditional independence assumptions

and typically is over-confident in its prediction regardless of whether it is

correct or not.

Predictions Output

Page 25: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

25

Error Visualization

Page 26: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

26

Error Visualization

Little squares designate errors

Axes show example number

Page 27: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

27Slide adapted from Eibe Frank's

Find which attributes are the most predictive ones

Two parts: search method: – best-first, forward selection, random, exhaustive, genetic

algorithm, ranking

evaluation method: – information gain, chi-squared, etc.

Very flexible: WEKA allows (almost) arbitrary combinations of these two

Explorer: Attribute Selection

Page 28: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

28

Individual Features Ranking

Page 29: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

29

misc.forsale

comp.graphics

rec.sport.hockey

Individual Features Ranking

Page 30: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

30

misc.forsale

comp.graphics

rec.sport.hockey

???

random number

seed

Individual Features Ranking

Page 31: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

31Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's

feature correlation

2-Way Interactions

Feature Interactions

C

BA

category

feature feature

importance of feature B

importance of feature A

Page 32: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

32Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's

3-Way Interaction: What is common to A, B and C together;

and cannot be inferred from pairs of features.

Feature Interactions

C

BA

category

feature feature

importance of feature B

importance of feature A

Page 33: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

33Slide adapted from Guozhu Dong's

Feature Subsets Selection

Problem illustration

Full setEmpty setEnumeration

SearchExhaustive/Complete (enumeration/branch&bounding)Heuristic (sequential forward/backward)Stochastic (generate/evaluate)Individual features or subsets generation/evaluation

Page 34: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

34

Features Subsets Selection

Page 35: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

35

misc.forsale

comp.graphics

rec.sport.hockey

17,309 subsets considered21 attributes selected

Features Subsets Selection

Page 36: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

36

Saving the Selected Features

All we can do from this tab is to save the buffer in a text file. Not very useful...

But we can also perform feature selection during the pre-processing step...(the following slides)

Page 37: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

37

Features Selection on Preprocessing

Page 38: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

38

Features Selection on Preprocessing

Page 39: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

39

Features Selection on Preprocessing

679 attributes: 678 + 1 (for the class)

Page 40: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

40

Features Selection on Preprocessing

Just 22 attributes remain:

21 + 1 (for the class)

Page 41: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

41

Run Naïve Bayes With the 21 Features

higher accuracy

21 Attributes

Page 42: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

42

different/easy class

accuracy

(AGAIN) Naïve Bayes With All Features

ALL 679 Attributes(repeated slide)

Page 43: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

43

Sometimes WEKA has a weird naming for some algorithms

Here is how to find the algorithms Barbara introduced: Naïve Bayes: weka.classifiers.bayes.NaiveBayes Perceptron: weka.classifiers.functions.VotedPerceptron Winnow: weka.classifiers.functions.winnow Decision tree: weka.classifiers.trees.J48 Support vector machines: weka.classifiers.functions.SMO k nearest neighbor: weka.classifiers.lazy.IBk

Some of these are more sophisticated versions of the classic algorithms

e.g. I cannot find the classic Naïve Bayes in WEKA (although there are 5 available implementations).

Some Important Algorithms

Page 44: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

44

The 20 Newsgroups Text Collection

WEKA: Explorer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

Page 45: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

45Slide adapted from Eibe Frank's

Experimenter makes it easy to compare the performance of different learning schemes

Problems: classification regression

Results: written into file or databaseEvaluation options:

cross-validation learning curve hold-out

Can also iterate over different parameter settingsSignificance-testing built in!

Performing Experiments

Page 46: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

46

Experiments Setup

Page 47: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

47

Experiments Setup

Page 48: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

48

Experiments Setup

CSV file: can be open in Exceldatasets

algorithms

Page 49: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

49

Experiments Setup

Page 50: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

50

Experiments Setup

Page 51: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

51

Experiments Setup

Page 52: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

52

Experiments Setup

Page 53: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

53

Experiments Setup

accuracy

SVM is the best

Decision tree is the

worst

SVM is statistically better than Naïve Bayes

Decision tree is statistically worse than Naïve Bayes

Page 54: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

54

Experiments: Excel

Results are output into an CSV file, which can

be read in Excel!

Page 55: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

55

The 20 Newsgroups Text Collection

WEKA: Explorer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

Page 56: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

56Slide adapted from Eibe Frank's

@relation heart-disease-simplified

@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}

@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...

WEKA File Format: ARFF

Other attribute types:

• String

• Date

Numerical attribute

Nominal attribute

Missing value

Page 57: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

57

Value 0 is not represented explicitlySame header (i.e @relation and @attribute tags)the @data section is different

Instead of @data

0, X, 0, Y, "class A"0, 0, W, 0, "class B"

We have

@data

{1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}

This is especially useful for textual data (why?)But! Problems with feature selection: cannot save results

WEKA File Format: Sparse ARFF

Page 58: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

58

Python Interface to WEKA

Works on the 20 newsgroups collectionExtracts the features

currently words easy to modify, just change one or more of:– extract_features_and_freqs()– is_feature_good() – build_stoplist()

Allows to filter out: the stopwords the infrequent features

Features are weighted by document frequencyProduces an ARFF file to be used by WEKA

Page 59: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

59

Python Interface to WEKA

Allows to specify: which subset of classes to consider the number of documents for each class the minimum feature frequency regular expression pattern a feature should match whether to remove the stopwords whether to convert words to lowercase kind of output to produce:

sparse (i.e., feature = value) full vector (list of values)

Page 60: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

60

Python Interface to WEKA: How To

Needs installed "20_newsgroups“ and "stopwords“To get the things working under Windows:

open “__init__.py”in the code below, substitute “/” with “\\”

##################################################### 20 Newsgroupsgroups = [(ng, ng+'/.*') for ng in ''' alt.atheism rec.autos sci.space comp.graphics rec.motorcycles soc.religion.christian comp.os.ms-windows.misc rec.sport.baseball talk.politics.guns comp.sys.ibm.pc.hardware rec.sport.hockey talk.politics.mideast comp.sys.mac.hardware sci.crypt talk.politics.misc comp.windows.x sci.electronics talk.religion.misc misc.forsale sci.med'''.split()] twenty_newsgroups = SimpleCorpusReader( '20_newsgroups', '20_newsgroups/', '.*/.*', groups, description_file='../20_newsgroups.readme')del groups # delete temporary variable

Page 61: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

61

Python Interface to WEKA

The Main Function

Page 62: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

62

Python Interface to WEKA

Example Usage

Python dictionary

Estimated over the whole set! cross-validation: OK; test/train: not OK

Use 1

Page 63: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

63

Python Interface to WEKAFunctions You Will Probably Want To Modify

convert to lowercase

Also: stemming!Also: word+POS!

Also: compounds!

Page 64: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

64

Python Interface to WEKAYou might want to add… Stemming

Porter stemmer>>> cats = Token(TEXT='cats', POS='NN')

>>> from nltk.stemmer.porter import *

>>> porter = PorterStemmer()

>>> porter.stem(cats)

>>> print cats

<POS='NN', STEM='cat', TEXT='cats'>

WordNet stemmer morphy – morphological analyzer you need the following packages installed:– nltk.wordnet– nltk-contrib.pywordnet

>>> from nltk_contrib.pywordnet.stemmer import *

>>> morphy('dogs')

'dog'

Page 65: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

65

Python Interface to WEKAYou might want to add… TF.IDF

TF.IDF: tij log(N/ni) TF– tij: frequency of term i in document j

– this is how features are currently weighted

IDF: log(N/ni)

– ni: number of documents containing term i

– N: total number of documents

Modify the function extract_features_and_freqs_forall()

Page 66: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

66

The 20 Newsgroups Text Collection

WEKA: Explorer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

Page 67: 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004

67

Summary

The 20 Newsgroups Text Collection

WEKA: The ToolkitExplorer

– Classification– Feature selection

ExperimenterARFF file format

Python Interface to WEKAfeature extraction

stemmingWeighting: TF.IDF

WEKA: Real-time Demo