ai group-seminar-2013 nbc

Post on 26-Jun-2015

269 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Automated Classification of Short Message Service (SMS)

ALOYSIUS OCHOLA

oaloxde@yahoo.co.uk

MAKERERE UNIVERSITY

ARTIFICIAL INTELLIGENCE GROUP

USING NAÏVE BAYES ALGORITHM

Artificial Intelligence Seminar . May 30 . 2013

2Automated Classification of SMS using Naïve Bayes Algorithm

Classification

• A supervised learning technique that involves assigning a label to a set of

unlabeled input objects.

• Based on the number of classes present, there are two types of

classification:

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

– Binary classification; classifies the members of a given set of objects into one of

the two classes

– Multi-class classification; classifying instances into more than two classes.

• Unlike a better understood binary classification, the multiclass

classification is more complex and less researched.

3Automated Classification of SMS using Naïve Bayes Algorithm

Text Classification/Categorization

• Text documents is one of the several areas where classification can

be applied.

• TC (text classification/categorization) is the application of

classification algorithms on documents of text in order to

automatically group them to predefined categories.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

automatically group them to predefined categories.

• How to represent text documents

– Preprocessing and feature selection

• How to build the classifier; compute a classification function.

– Training classifier and classifying

4Automated Classification of SMS using Naïve Bayes Algorithm

Short Text Documents

• Normal documents like email, journals, etc are typically

large and are rich with content (natural languages).

– Easy to apply traditional classification approaches which rely on

word frequencies.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• Unlike short text documents like SMS & Twitter messages,

Forum posts , etc where word occurrence is too small.

– Dealing with short text therefore shall require just a little more

than traditional techniques.

• Especially during preprocessing and feature selection

5Automated Classification of SMS using Naïve Bayes Algorithm

Applications of TC

• Spam filtering, a process which tries to discern E-mail spam messages from

legitimate emails

• Email routing, sending an email sent to a general address to a specific address or

mailbox depending on topic.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• Language identification, automatically determining the language of a text

• Genre classification, automatically determining the genre of a text.

• Movie reviewing, automatically classify them as good, bad and neutral.

• Etc . . .

6Automated Classification of SMS using Naïve Bayes Algorithm

Data Preprocessing

• The data captured in real world is so noisy, inconsistent and has no quality.

Some cleaning and transformation required.

• Quality results from short text will see most of the major steps of text

preprocessing skipped and some selected ones modified.

• Tokenization and lowercasing: splitting text streams to tokens and forced

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• Tokenization and lowercasing: splitting text streams to tokens and forced

lowercasing.

– Word boundary detection, using whitespace and punctuation

– Note: Prepared corpus was lowercased.

• Minor spell-correction: although there’s a growing culture of using short-

hands (not formal) in SMS texts, some spell corrections can still done.

7Automated Classification of SMS using Naïve Bayes Algorithm

Data Preprocessing (cont)

– Regular expression replacer: replacing words used with apostrophes with their

matching regular expressions.

• list pairs of RE apostrophes-word and correction Ex.Willn’t : will not, didn’t : did not, . . .

– Repeat replacer: people are not often strictly grammatical. May write "I

looooove it" to emphasize the word "love“.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

looooove it" to emphasize the word "love“.

• Before replacing any characters from the supplied word

– Module replaces any word with more than two repeating characters to just two as no such words can

exist in the English vocabulary, for example “goooooooose” to “goose”.

» RE: [(\w*)(\w)\*(\w*)]

– And then look-up if WordNet (a lexical database for English natural language) recognizes the supplied

word.

8Automated Classification of SMS using Naïve Bayes Algorithm

Data Preprocessing (cont)

• Then, if otherwise use regular expression (RE) [(\w*)(\w)\2(\w*)]

to remove extra repeated characters from the word.

– Matches 0 or more starting characters (\w*)

– A single character (\w), followed by another instance of that character \2

– Then 0 or more ending characters (\w*)

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

– Then 0 or more ending characters (\w*)

• Stop-words filtering: process of removing most

frequent words that exist in a document.

– Looking-up into a file containing stop words and return

only words not in the file/dictionary.

9Automated Classification of SMS using Naïve Bayes Algorithm

A Classifier

• A classifier is built on a function f which will determine a category of an

input feature vector x, given a fixed set of classes C={c1, c2,…,cn} and a

description of features xX

– where X is the feature space to the output class labels.

• In simple terms; f(x) C.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• In simple terms; f(x) C.

– where f(x) is the classification function whose domain is X and whose range is

C. The class labels C can be ordered or unordered (categorical)

• A classifier is expected to learn from learn from a set of N input-output

pairs or simply training data set and predict a class of unseen input. That is

to say, mapping X to C .CXf :

10Automated Classification of SMS using Naïve Bayes Algorithm

Building the Text Classifier

• For the particular case, we will deal with a probabilistic text classifier ft based on

Naïve Bayes classification (NBC) Theorem.

• Building the classifier will therefore involve a recursive processes of creating a

functional classifier by training it with example data set (NB learning) and running

the trained classifier on unknown content to determine class membership for the

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

the trained classifier on unknown content to determine class membership for the

unknown content classification (Bayesian Classification).

• Probabilistic classifier, to predict the class membership of a certain new document

X, calculates the probability of a class C given that document, that is:

• -> XCP |

11Automated Classification of SMS using Naïve Bayes Algorithm

Naïve Bayes Algorithm

• It is a simple probabilistic learning and classification methods built upon

the Bayes’ probabilistic theory.

• It assumes that the presence (or absence) of a particular feature of a class

is not related to the presence (or absence) of any other feature (naïve

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

assumption).

• Uses prior probability of each category given no information about

an item.

• Categorization produces a posterior probability distribution over

the possible categories given a description of an item.

CP

XCP |

12Automated Classification of SMS using Naïve Bayes Algorithm

Naïve Bayes (NB) Probability Theorem

• Derived from the definition of conditional probability

– probability that an event will occur, when another event is known to occur or to have occurred.

• From the product rule, given events C and X.

0)(,)|( )(

)( XPXCP XP

XCP

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• given as:

• Bayes Rule:

• ->

0)(,)|( )( XPXCP XP

)().|()().|()( CPCXPXPXCPXCP

0)(,)|( )(

)().|( XPXCP XP

CPCXP

)( CXPXCP

P(C): Prior probability, the initial probability that C holds before seeing any evidenceP(X): Probability that X is observedP(X|C): Likelihood, probability of observing X given that C holdsP(C|X): Posterior probability, the probability that C holds given X is observed

Equation (1)

13Automated Classification of SMS using Naïve Bayes Algorithm

Deriving NB Classification Algorithm

• Given a set of feature vectors for each possible class C, the task of the

NBC (NB classification) algorithm is to approximate the probability of new

input features X to be present in C , that is, the class posterior or simply

the greatest .)|( XCcP

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• Assume C boolean random variables and a vector space X containing n

boolean attributes:

– If ci is the ith possible value of C and xk denotes the kth attribute of X

– Applying NB probability theorem (Equation (1)):

j

iikki

cjCPcjCxkXP

cCPcCxXPxXcCP

)().|(

)().|()|(

Equation (2)

14Automated Classification of SMS using Naïve Bayes Algorithm

• NB conditional Independence Assumption: Features (term presence) are

independent of each other given the class. A new document of n features

can therefore be classified into one of C classes using equation (2) as:

• The aim of the classifier is to return the maximum posterior probability of

Deriving NBC Algorithm

n

kk CxPXCP

1

)|()|(

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• The aim of the classifier is to return the maximum posterior probability of

c, thus:

• Further, because the sample space (denominator) is always constant for all

the classes and does not depend on any class ci of C, the NBC theorem is

given as:

j k jkj

k ii

ci

cCxPcCP

cCPcCPcC

i )|()(

)()(maxarg

k ii

ci cCPcCPcC

i

)()(maxarg Equation (3)

15Automated Classification of SMS using Naïve Bayes Algorithm

Training Naïve Bayes Text Classifier

• During the training process, the classification

function ft, extracts, selects the most useful

features from the example corpus and labels

them with their appropriate class.

– Construct and store a mapping of feature-set:label

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

– Construct and store a mapping of feature-set:label

pair sets (training dataset); which ft will learn from.

• feature-set is a list of preprocessed and unique term

occurrences from the document samples

• label is the known class of that feature-set.

16Automated Classification of SMS using Naïve Bayes Algorithm

Feature Representation

• Features describes and represents texts in format suitable for further machine

processing.

• Final performance depends on how descriptive features are used for text

description.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

Supervised learning classifiers can use any sort of feature

URL, email address, punctuation, capitalization, dictionaries, network features

• Word based feature (Bag of Words): feature extraction process to transform the

plain documents, which are merely strings of text, into a feature set containing

the (frequency of) occurrence of each word that is usable by a classifier.

17Automated Classification of SMS using Naïve Bayes Algorithm

Feature Selection

• Text collections have a large number of features yet some classifiers can’t deal with

a very larger number of features. Therefore performing feature Selection would

ensure reduced training time and improve performance as it eliminates noise from

features and avoids over fitting.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• Term Weighting: Each term in a document vector must be associated with a value

(weight) which measures the importance of this term and denotes how much this

term contributes to the categorization task of the document.

– Depend on information theory; frequency count of every word

– chi-squared statistical distribution; score measure of bigram of each word per-label

18Automated Classification of SMS using Naïve Bayes Algorithm

Text Classification

• One step classifier testing process of taking the built text classifier ft and running it

on unknown content to determine class membership for that content.

• New input (test) SMS stream is passed to the classifier.

• Preprocesses the stream and compares it with the set of pre-classified examples (training set).

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

Numerical underflow

• In equation (3), many conditional probabilities are multiplied one for each position of X

• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-

point underflow.

• since log(xc)=log (x)+log (c), it is better to perform all computations by summing natural logs

of the probabilities rather than multiplying them. Therefore, during text classification, a

normalized NBC equation (given bellow) is used.

nkik

c

cCxPciCPCi 1

)|(log)(logmaxarg

19Automated Classification of SMS using Naïve Bayes Algorithm

Implementation Pseudo Algorithm

for a given unknown input document:

• break the input stream into word tokens

• preprocess the tokens

• for a given training set:

– count the number of documents in each class

– for every training document:• for each class:

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• for each class:

– if a preprocessed token appears in the document:

increment the count for tokens

• for each class:

– for each preprocessed token

divide the token count by the total token count to get conditional probabilities

• return log conditional probabilities for each class

for all the individual class log conditional probabilities:

• compute a comparison of the probability values

return the class with the greatest probability (maximum likelihood hypothesis).

20Automated Classification of SMS using Naïve Bayes Algorithm

Evaluation and Implementation Approach

• Evaluation: test SMS text documents to assess classifier

success on the prediction of the class .

• Implementation: complete text classification application

with user interactive interface.

testsofnumberTotal

edictionsCorrect

___

Pr_

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

– Natural Language Processing approach

• Natural Language ToolKit (NLTK) used with Python programming

language.

– NLTK is entirely self-contained and provides convenient functions and

wrappers that can be used as building blocks for common NLP tasks.

21Automated Classification of SMS using Naïve Bayes Algorithm

BIBLIOGRAPHY

Automated Classification of Short Messaging Services (SMS)

Messages for Optimized Handling

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

oaloxde@yahoo.co.uk

Aloysius OcholaMsc. Computer Science Project

Makerere University Kampala (2013)

22Automated Classification of SMS using Naïve Bayes Algorithm

DEMO . . .

• Training samples collected from manually categorized SMS

message compiled by Ureport, an SMS based opinion forum

• Problem:They receives up-to 10,000 SMS messages in a day

and are supposed to reply to all the messages, if it is relevant

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

and worthy.

smsTextClassificationApplication

top related