© lingfeng mo classifying programming newsgroup discussions using text categorization algorithms...

58
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying Programming Newsgroup Discussions using Text Categorization Algorithms by Lingfeng Mo

Upload: imogen-rodgers

Post on 18-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1

A Study of Text Categorization

Classifying Programming Newsgroup Discussions using Text Categorization

Algorithms

by

Lingfeng Mo

Page 2: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 2

What is text categorization?

Definition– Classification of documents into a fixed

number of predefined categories.

– Sometimes alternately referred to as text data mining.

Page 3: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 3

What is Data Mining?

Many Definitions– Non-trivial extraction of implicit, previously

unknown and potentially useful information from data

– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Page 4: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 4

Data Mining Tasks...

Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Regression [Predictive]

Deviation Detection [Predictive]

Page 5: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 5

Data Mining Tasks

Prediction Methods

– Use some variables to predict unknown or future values of other variables.

Description Methods

– Find human-interpretable patterns that describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Page 6: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 6

Classification: Definition

Given a collection of records (training set )– Each record contains a set of attributes, one of the

attributes is the class. Find a model for class attribute as a function

of the values of other attributes. Goal: previously unseen records should be

assigned a class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Page 7: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 7

Classification Example

TestSet

Training Set

ModelLearn

Classifier

Randomly choose certain portion

12 German documents

12 English documents

Page 8: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 8

Classification Example Result

Page 9: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 9

Why our study?

Programmer often seek and exchange information about problems on using a certain library, framework, or API online.

Titles not corresponding to the content in newsgroup discussion.

Novice doesn’t know how to ask a question exactly. By categorizing an ongoing discussion, such techniques

could be used to directly point out previous discussions of similar problems to the developers who ask questions. 

Page 10: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 10

What’s this study for?

Ideal Goal: Automatically classifying discussions into meaningful semantic categories.

– Approach:Collect and save raw dataImport and optimize dataSelect certain portion of data to train a classifier

model.Classify dataEvaluate results

Page 11: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 11

Tool we use

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. --a machine learning for language toolkit.

Via: http://mallet.cs.umass.edu/

Page 12: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 12

Data Collecting

Download discussions from Java programming forum

Save each discussion as a text document(.txt) – Article(text, name)

Manually put similar discussions into the same folder (labels)

Page 13: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 13

Import Data

Input: Labels included with their articles

– How it works?

Output: Mallet document

Char Sequence

TokenSequenc

e

FeatureVectors Data

Name of Article Name/

Source

Label Target

Name of Drive

Name of Folder+ +

Name of Article

}Instance

Page 14: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 14

Example of Import Data

Import-fileImport-dir

Page 15: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 15

Training Classifier

Input the produced data by importing process Set training portion for the training set and test

set K-Fold Cross-Validation – 10 trials usually Set the trainer - NaiveBayesTrainer

Page 16: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 16

Basic Bayes theorem

A probabilistic framework for solving classification problems

Conditional Probability:

Bayes theorem:

)()()|(

)|(APCPCAP

ACP

)(),(

)|(

)(),(

)|(

CPCAP

CAP

APCAP

ACP

Page 17: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 17

Example of Bayes Theorem

Given: – A doctor knows that cold causes cough 50% of the time

– Prior probability of any patient having cold is 1/50,000

– Prior probability of any patient having cough is 1/20

If a patient has cough, what’s the probability he/she has cold?

0002.020/1

50000/15.0

)(

)()|()|(

SP

MPMSPSMP

Page 18: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 18

Example of Train Classifier

Page 19: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 19

Output of Classification

Confusion matrix Test data accuracy for every trials Train data accuracy mean

– Standard Deviation

– Standard Error Test data accuracy mean

– Standard Deviation

– Standard Error

Page 20: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 20

Exp. picture of Classification(1 of 2)

Page 21: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 21

Exp. picture of Classification(2 of 2)

Page 22: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 22

How to improve the accuracy?

Increase recognition rate – Words Splitting

Unify words’ tense - Words Stemming

Get rid of noisy data – Remove StopWords

About overlapped categories - Top N Method

Page 23: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 23

Words stemming

Change Verb’s Tense back to original

– Ex. Performed -> perform

Page 24: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 24

Words Splitting (1 of 3)

In what case we could split a word?

– Punctuation

– Blank

– Under Line

Page 25: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 25

Words Splitting (2 of 3)

Examples:

– Ex. Set_Value -> Set Value;

– ImageIcon("myIcon.gif")); -> ImageIcon myIcon gif;

– actionPerformed(ActionEvent e) -> actionPerformed Action Event e

See any problems?

– There are some cases that people like to write many words or words with numbers together.

– Ex. JButton, actionListener, Button1,2,3 and etc.

Page 26: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 26

Words Splitting (3 of 3)

What the special cases are?– 1. Begin with any number of capital letters combined with ONE or

couple words. Ex. JFrame -> J Frame; JJJJJJJButton -> JJJJJJJ Button; JButtonApple -> J Button Apple– 2. lower case letter/letters or lower case word combined with a

word begin with capital letterEx. cButton -> c Button; ccccccccButton -> cccccccc Button;setValue -> set Value; addActionListener -> add Action Listener; – 3. Many words ALL begin with capital letter combined togeter. Ex. MyFrame-> My Frame; SetActionCommand -> Set Action Command– 4. Combined with word and numbersEX. Button1 -> Button 1; 1Button -> 1 ButtonButton123 -> Button 123; 123Button -> 123 Button;

Page 27: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 27

Remove Stopwords (1 Of 2)

What is stop words?

– The most common, short function words, such as the, is, at, which, and, on.

Any special cases?

Page 28: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 28

Remove Stopwords (2 Of 2)

Extra Stop Words

– Programming words.

Ex. public, private, class, new and etc.

Words Frequency Counter helps.

Page 29: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 29

Overlapped Categories

Each category is treated as independent label by default.

How to solve realistic problems?

– Top N Method

Page 30: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 30

Top N Method

Regular Way

Top N Method

Page 31: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 31

Some of our test results

Test base on 10 different labels and 45 instances in total.

Let’s see some pictures help us directly perceived through the senses

Page 32: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 32

Classify data with Original Mallet

Raw Mal l et

0

0. 05

0. 1

0. 15

0. 2

0. 25

0. 3

0. 35

Number of t i mes

Test

Acc

urac

y Me

an

Raw Mal l et

Raw Mal l et 0. 26 0. 19 0. 28 0. 24 0. 24 0. 28 0. 26 0. 28 0. 32 0. 24

1 2 3 4 5 6 7 8 9 10

Lowest: 19%

Highest: 32%

Average: 25.9%

Page 33: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 33

After Stemming & Words Splitting

Tokeni zat i on & Words Spl i t t i ng

0

0. 1

0. 2

0. 3

0. 4

0. 5

Number of Ti mes

Test

Acc

urac

y Me

an

Stemmi ng &Words Spl i t t i ng

Stemmi ng & WordsSpl i t t i ng

0. 36 0. 4 0. 4 0. 26 0. 44 0. 28 0. 3 0. 36 0. 34 0. 36

1 2 3 4 5 6 7 8 9 10

Lowest: 26%

Highest: 44%

Average: 35%

Page 34: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 34

After remove Stop Words

Stop Words Removed

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

Number of Ti mes

Test

Acc

urac

y Me

an

Stop Words Removed

Stop WordsRemoved

0. 62 0. 42 0. 44 0. 5 0. 44 0. 4 0. 4 0. 5 0. 42 0. 36

1 2 3 4 5 6 7 8 9 10

Lowest: 36%

Highest: 62%

Average: 45%

Page 35: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 35

Top N Method been used

Lowest: 54%

Highest: 72 %

Average: 63.6%

Top N Method

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

Number of Ti mes

Test

Acc

urac

y Me

an

Top N Method

Top N Method 0. 64 0. 72 0. 58 0. 7 0. 58 0. 64 0. 72 0. 68 0. 54 0. 56

1 2 3 4 5 6 7 8 9 10

Page 36: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 36

Any way to improve accuracy again?

Highlight the key feature.

Use only code data as training data.

Page 37: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 37

Only code Data

Delete all the text other than code.

What is considered as code?

Code includes not only a snippet of code more than one line, but also a class name, such as JButton and JActionListener, or a method call, such as addActionListener(aListener).

Page 38: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 38

Test Result of Code only Data

Code Onl y Versi on

0

0. 05

0. 1

0. 15

0. 2

0. 25

0. 3

0. 35

0. 4

0. 45

Number of Ti mes

Test

Acc

urac

y Me

an

Code Onl y Versi on

Code Onl y Versi on 0. 24 0. 34 0. 4 0. 32 0. 36 0. 3 0. 34 0. 42 0. 4 0. 28

1 2 3 4 5 6 7 8 9 10

Lowest: 24%

Highest: 42%

Average: 34%

Lowest: 54%

Highest: 72 %

Average: 63.6%

Page 39: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 39

Why this happened?

Only code data is not enough.

Can not remove too much data, especially those actually contributed to feature selection.

Is our data size not big enough?

Page 40: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 40

Increase the Data Scale

What we done?

- Increase the total instances from 45 – 158

- Increase the num of labels from 10 - 17

Data analysis and Quality improvement since categories may overlap

Page 41: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 41

After Data Scale Increased

Data Scal e I ncreased

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

Number of Ti mes

Test

Acc

urac

y Me

an

Data Scal e I ncreased

Data Scal eI ncreased

0. 62 0. 56 0. 63 0. 68 0. 52 0. 54 0. 63 0. 64 0. 6 0. 55

1 2 3 4 5 6 7 8 9 10

Lowest: 36%

Highest: 62%

Average: 45%

Lowest: 51.88%

Highest: 67.5 %

Average: 59.75%

Page 42: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 42

After Data Scale Increased with Top N

Data Scal e I ncreased

0. 55

0. 6

0. 65

0. 7

0. 75

0. 8

0. 85

Number of Ti mes

Test

Acc

urac

y Me

an

Data Scal e I ncreased

Data Scal eI ncreased

0. 75 0. 69 0. 69 0. 66 0. 79 0. 76 0. 7 0. 74 0. 72 0. 71

1 2 3 4 5 6 7 8 9 10

Lowest: 54%

Highest: 72 %

Average: 63.6%

Lowest: 66.25%

Highest: 79.38 %

Average: 72.05%

Page 43: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 43

Why the accuracy increased?

Naïve Bayes classifier using Gaussian distribution to represent the class-conditional probability for continuous attributes, so we are wondering that if the frequency distribution of each word in the articles looks like a normal distribution

Count the frequencies of each word in the articles to create a histogram and to see whether the histogram looks like a normal distribution

Page 44: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 44

Histogram for Word A

Page 45: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 45

Histogram for Word B

Page 46: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 46

Only Code Data again

Lowest: 51.54 %

Highest: 64.62 %

Average: 58.23 %

Onl y Code

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

Number of Ti mes

Test

Acc

urac

y Me

an

New Data Code Onl y

New Data Code Onl y 0. 6150. 5150. 5690. 6230. 5770. 554 0. 6 0. 5920. 6460. 531

1 2 3 4 5 6 7 8 9 10

Lowest: 66.25%

Highest: 79.38 %

Average: 72.05%

Page 47: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 47

Data Without Code

Lowest: 67.5 %

Highest: 76.25 %

Average: 71.44 %

Data wi thout Code

0. 620. 640. 660. 680. 7

0. 720. 740. 760. 78

Number of Ti mes

Test

Acc

urac

y Me

an

New Data whi tout Code

New Data whi toutCode

0. 6750. 7190. 7630. 6940. 706 0. 7 0. 6820. 7440. 7380. 725

1 2 3 4 5 6 7 8 9 10

Lowest: 66.25%

Highest: 79.38 %

Average: 72.05%

Page 48: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 48

Results Analysis

Different from human beings, code is not the decisive factor.

Base on our prepared data, code is only a small part of a single instance.

Page 49: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 49

Compare to Maximum Entropy

Lowest: 57.5 %

Highest: 70 %

Average: 63.06%

MaxEnt

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

Number of Ti mes

Test

Acc

urac

y Me

an

MaxEnt

MaxEnt 0. 6375 0. 6625 0. 575 0. 7 0. 6688 0. 6375 0. 6063 0. 65 0. 575 0. 5938

1 2 3 4 5 6 7 8 9 10

Lowest: 51.88%

Highest: 67.5 %

Average: 59.75%

Page 50: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 50

Maximum Entropy with Top N

Lowest: 69.38%

Highest: 83.76 %

Average: 78.63%

MaxEnt Wi th Top N Method

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

Number of Ti mes

Test

Acc

urac

y Me

an

Max Ent

Max Ent 0. 8125 0. 80625 0. 8375 0. 775 0. 7875 0. 76875 0. 69375 0. 75625 0. 8 0. 825

1 2 3 4 5 6 7 8 9 10

Lowest: 66.25%

Highest: 79.38 %

Average: 72.05%

Page 51: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 51

Generative and discriminative models

Generative (Joint) model -> P(c,d)Place P over both observed data and hidden stuff.

- Ex. Naive Bayes

Discriminative (Conditional) models -> P(c|d)Take data as given, place a P over hidden structure given the data.

- Ex. Maximum Entropy, SVMs

Let’s see a picture directly.

Page 52: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 52

Generative and discriminative models

Bayes net diagrams draw circles for random variables, and lines for direct dependencies

Some variables are observed; some are hidden Each node (conditional model) is a little classifier

based on incoming arcsC

d1 d2

d3

c

d1 d2d3

Conditional modelsJoint models

Page 53: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 53

SVM-light

SVM-light is an implementation of Support Vector Machines (SVMs) in C.

Download at http://svmlight.joachims.org/ Solves classification and regression problems.

solves ranking problems Efficiently computes Leave-One-Out estimates

of the error rate, the precision, and the recall. Supports standard kernel functions and lets you

define your own

Page 54: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 54

Format of input file

The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>

Explain<target> .=. +1 | -1 | 0 | <float> <feature> .=. <integer> <value> .=. <float><info> .=. <string>

Page 55: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 55

How it Works – Linear Mapping

picture from http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf

Page 56: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 56

Polynomial mapping

picture from http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf

Page 57: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 57

Future Work

On-going experiments, result analysis based on SVM Continually Increase Data scale Additional increase the identification rate on SVM. Compare the accuracy difference between generative

(Naive Bayes) and discriminative (SVM) models base on our results since we think the main reason that determines the accuracy is not the tool itself but how to select the suitable tool that best matches the data model that underlies a given text categorization problem. 

Page 58: © Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 58

Reference

McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu. 2002.

Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining, 2005

Lecture 7 of Artificial Intelligence | Natural Language Processing Course at Stanford. Instructor: Manning, Christopher D.

K. Nigam, J. Lafferty, and A. Mccallum. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.

T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.

T. Joachims. Text categorization with support vector machines: Learning with many relevant features. University at Dortmund, LS VIII, 1997.