elements of text mining part - i

106
Elements of Text Mining Part - I Jaganadh G http://jaganadhg.in cba Jaganadh G Elements of Text Mining

Upload: jaganadh-gopinadhan

Post on 09-May-2015

5.394 views

Category:

Education


4 download

DESCRIPTION

Basics of Text Mining with Python

TRANSCRIPT

Page 1: Elements of Text Mining Part - I

Elements of Text MiningPart - I

Jaganadh Ghttpjaganadhgin

cba

Jaganadh G Elements of Text Mining

Tokenization

Tokenization

Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining

Tokenizig text with Python

import re

def tokenize(text)

tokenizer = recompile(rsquoWrsquo)

return tokenizersplit(textlower())

doc = John likes to watch movies Mary likes too

words = tokenize(doc)

print words

Jaganadh G Elements of Text Mining

Tokenization

Tokenization

Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining

Tokenizig text with Python

import re

def tokenize(text)

tokenizer = recompile(rsquoWrsquo)

return tokenizersplit(textlower())

doc = John likes to watch movies Mary likes too

words = tokenize(doc)

print words

Jaganadh G Elements of Text Mining

Twokenization

Rise of social media introduced new orthographic patterns in digital text Typical example is a tweet wherepeople use abbreviated forms of words emoticons hash-tags etc Generic text tokenization techniques wont yieldgood result in separating words in social media text like tweets A good social media tokenizer has to take care ofemoticons hash-tags shortened urls etc

Social media tokenization with Python using happyfuntokenizing 1

from happyfuntokenizing import Tokenizer

def twokenize(tweetpc=True)

twokenizer = Tokenizer(preserve_case=pc)

return twokenizertokenize(tweet)

tweet = RT USER Relevant 2 clinical text gt Recursive neural networks

Deep Learning Natural Language Processing NLProc httptco

twokens = tokenize(tweet)

1httpsbitbucketorgjaganadhgtwittertokenize

Jaganadh G Elements of Text Mining

Sentence Tokenization

Heuristic sentence boundary detection algorithm 2

Place putative sentence boundaries after all occurrences of ( and maybe - )

Move the boundary after following quotation marks if any

Disqualify a period boundary in the following circumstances

If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally

Disqualify a boundary with a or if

It is followed by a lowercase letter (or a known name)

Regard other putative sentence boundaries as sentence boundaries

2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Jaganadh G Elements of Text Mining

Sentence Tokenization

Sentence Tokenization with Python and NLTK

from nltkdata import load

tokenizer = load(rsquotokenizerspunktenglishpicklersquo)

text = How can this be implemented There are a lot of subtleties

such as dot being used in abbreviations

sents = tokenizertokenize(text)

for sent in sents

print sent

Jaganadh G Elements of Text Mining

Counting Words

Word Count - Python

def word_count(text)

words = tokenize(text)

word_freq = dict([(word wordscount(word)) for word

in set(words)])

return word_freq

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wc = word_count(text)

for wordcount in wcitems()

print word t t count

Jaganadh G Elements of Text Mining

Finding Word Length

Word Length

def word_length(text)

words = tokenize(text)

word_length =

[word_length__setitem__(len(word)1 +

word_lengthget(len(word)0)) for word in words]

return word_length

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wl = word_length(text)

for length count in wlitems()

print There are d words of length d (count length)

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 2: Elements of Text Mining Part - I

Tokenization

Tokenization

Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining

Tokenizig text with Python

import re

def tokenize(text)

tokenizer = recompile(rsquoWrsquo)

return tokenizersplit(textlower())

doc = John likes to watch movies Mary likes too

words = tokenize(doc)

print words

Jaganadh G Elements of Text Mining

Tokenization

Tokenization

Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining

Tokenizig text with Python

import re

def tokenize(text)

tokenizer = recompile(rsquoWrsquo)

return tokenizersplit(textlower())

doc = John likes to watch movies Mary likes too

words = tokenize(doc)

print words

Jaganadh G Elements of Text Mining

Twokenization

Rise of social media introduced new orthographic patterns in digital text Typical example is a tweet wherepeople use abbreviated forms of words emoticons hash-tags etc Generic text tokenization techniques wont yieldgood result in separating words in social media text like tweets A good social media tokenizer has to take care ofemoticons hash-tags shortened urls etc

Social media tokenization with Python using happyfuntokenizing 1

from happyfuntokenizing import Tokenizer

def twokenize(tweetpc=True)

twokenizer = Tokenizer(preserve_case=pc)

return twokenizertokenize(tweet)

tweet = RT USER Relevant 2 clinical text gt Recursive neural networks

Deep Learning Natural Language Processing NLProc httptco

twokens = tokenize(tweet)

1httpsbitbucketorgjaganadhgtwittertokenize

Jaganadh G Elements of Text Mining

Sentence Tokenization

Heuristic sentence boundary detection algorithm 2

Place putative sentence boundaries after all occurrences of ( and maybe - )

Move the boundary after following quotation marks if any

Disqualify a period boundary in the following circumstances

If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally

Disqualify a boundary with a or if

It is followed by a lowercase letter (or a known name)

Regard other putative sentence boundaries as sentence boundaries

2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Jaganadh G Elements of Text Mining

Sentence Tokenization

Sentence Tokenization with Python and NLTK

from nltkdata import load

tokenizer = load(rsquotokenizerspunktenglishpicklersquo)

text = How can this be implemented There are a lot of subtleties

such as dot being used in abbreviations

sents = tokenizertokenize(text)

for sent in sents

print sent

Jaganadh G Elements of Text Mining

Counting Words

Word Count - Python

def word_count(text)

words = tokenize(text)

word_freq = dict([(word wordscount(word)) for word

in set(words)])

return word_freq

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wc = word_count(text)

for wordcount in wcitems()

print word t t count

Jaganadh G Elements of Text Mining

Finding Word Length

Word Length

def word_length(text)

words = tokenize(text)

word_length =

[word_length__setitem__(len(word)1 +

word_lengthget(len(word)0)) for word in words]

return word_length

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wl = word_length(text)

for length count in wlitems()

print There are d words of length d (count length)

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 3: Elements of Text Mining Part - I

Tokenization

Tokenization

Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining

Tokenizig text with Python

import re

def tokenize(text)

tokenizer = recompile(rsquoWrsquo)

return tokenizersplit(textlower())

doc = John likes to watch movies Mary likes too

words = tokenize(doc)

print words

Jaganadh G Elements of Text Mining

Twokenization

Rise of social media introduced new orthographic patterns in digital text Typical example is a tweet wherepeople use abbreviated forms of words emoticons hash-tags etc Generic text tokenization techniques wont yieldgood result in separating words in social media text like tweets A good social media tokenizer has to take care ofemoticons hash-tags shortened urls etc

Social media tokenization with Python using happyfuntokenizing 1

from happyfuntokenizing import Tokenizer

def twokenize(tweetpc=True)

twokenizer = Tokenizer(preserve_case=pc)

return twokenizertokenize(tweet)

tweet = RT USER Relevant 2 clinical text gt Recursive neural networks

Deep Learning Natural Language Processing NLProc httptco

twokens = tokenize(tweet)

1httpsbitbucketorgjaganadhgtwittertokenize

Jaganadh G Elements of Text Mining

Sentence Tokenization

Heuristic sentence boundary detection algorithm 2

Place putative sentence boundaries after all occurrences of ( and maybe - )

Move the boundary after following quotation marks if any

Disqualify a period boundary in the following circumstances

If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally

Disqualify a boundary with a or if

It is followed by a lowercase letter (or a known name)

Regard other putative sentence boundaries as sentence boundaries

2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Jaganadh G Elements of Text Mining

Sentence Tokenization

Sentence Tokenization with Python and NLTK

from nltkdata import load

tokenizer = load(rsquotokenizerspunktenglishpicklersquo)

text = How can this be implemented There are a lot of subtleties

such as dot being used in abbreviations

sents = tokenizertokenize(text)

for sent in sents

print sent

Jaganadh G Elements of Text Mining

Counting Words

Word Count - Python

def word_count(text)

words = tokenize(text)

word_freq = dict([(word wordscount(word)) for word

in set(words)])

return word_freq

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wc = word_count(text)

for wordcount in wcitems()

print word t t count

Jaganadh G Elements of Text Mining

Finding Word Length

Word Length

def word_length(text)

words = tokenize(text)

word_length =

[word_length__setitem__(len(word)1 +

word_lengthget(len(word)0)) for word in words]

return word_length

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wl = word_length(text)

for length count in wlitems()

print There are d words of length d (count length)

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 4: Elements of Text Mining Part - I

Twokenization

Rise of social media introduced new orthographic patterns in digital text Typical example is a tweet wherepeople use abbreviated forms of words emoticons hash-tags etc Generic text tokenization techniques wont yieldgood result in separating words in social media text like tweets A good social media tokenizer has to take care ofemoticons hash-tags shortened urls etc

Social media tokenization with Python using happyfuntokenizing 1

from happyfuntokenizing import Tokenizer

def twokenize(tweetpc=True)

twokenizer = Tokenizer(preserve_case=pc)

return twokenizertokenize(tweet)

tweet = RT USER Relevant 2 clinical text gt Recursive neural networks

Deep Learning Natural Language Processing NLProc httptco

twokens = tokenize(tweet)

1httpsbitbucketorgjaganadhgtwittertokenize

Jaganadh G Elements of Text Mining

Sentence Tokenization

Heuristic sentence boundary detection algorithm 2

Place putative sentence boundaries after all occurrences of ( and maybe - )

Move the boundary after following quotation marks if any

Disqualify a period boundary in the following circumstances

If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally

Disqualify a boundary with a or if

It is followed by a lowercase letter (or a known name)

Regard other putative sentence boundaries as sentence boundaries

2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Jaganadh G Elements of Text Mining

Sentence Tokenization

Sentence Tokenization with Python and NLTK

from nltkdata import load

tokenizer = load(rsquotokenizerspunktenglishpicklersquo)

text = How can this be implemented There are a lot of subtleties

such as dot being used in abbreviations

sents = tokenizertokenize(text)

for sent in sents

print sent

Jaganadh G Elements of Text Mining

Counting Words

Word Count - Python

def word_count(text)

words = tokenize(text)

word_freq = dict([(word wordscount(word)) for word

in set(words)])

return word_freq

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wc = word_count(text)

for wordcount in wcitems()

print word t t count

Jaganadh G Elements of Text Mining

Finding Word Length

Word Length

def word_length(text)

words = tokenize(text)

word_length =

[word_length__setitem__(len(word)1 +

word_lengthget(len(word)0)) for word in words]

return word_length

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wl = word_length(text)

for length count in wlitems()

print There are d words of length d (count length)

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 5: Elements of Text Mining Part - I

Sentence Tokenization

Heuristic sentence boundary detection algorithm 2

Place putative sentence boundaries after all occurrences of ( and maybe - )

Move the boundary after following quotation marks if any

Disqualify a period boundary in the following circumstances

If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally

Disqualify a boundary with a or if

It is followed by a lowercase letter (or a known name)

Regard other putative sentence boundaries as sentence boundaries

2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Jaganadh G Elements of Text Mining

Sentence Tokenization

Sentence Tokenization with Python and NLTK

from nltkdata import load

tokenizer = load(rsquotokenizerspunktenglishpicklersquo)

text = How can this be implemented There are a lot of subtleties

such as dot being used in abbreviations

sents = tokenizertokenize(text)

for sent in sents

print sent

Jaganadh G Elements of Text Mining

Counting Words

Word Count - Python

def word_count(text)

words = tokenize(text)

word_freq = dict([(word wordscount(word)) for word

in set(words)])

return word_freq

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wc = word_count(text)

for wordcount in wcitems()

print word t t count

Jaganadh G Elements of Text Mining

Finding Word Length

Word Length

def word_length(text)

words = tokenize(text)

word_length =

[word_length__setitem__(len(word)1 +

word_lengthget(len(word)0)) for word in words]

return word_length

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wl = word_length(text)

for length count in wlitems()

print There are d words of length d (count length)

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 6: Elements of Text Mining Part - I

Sentence Tokenization

Sentence Tokenization with Python and NLTK

from nltkdata import load

tokenizer = load(rsquotokenizerspunktenglishpicklersquo)

text = How can this be implemented There are a lot of subtleties

such as dot being used in abbreviations

sents = tokenizertokenize(text)

for sent in sents

print sent

Jaganadh G Elements of Text Mining

Counting Words

Word Count - Python

def word_count(text)

words = tokenize(text)

word_freq = dict([(word wordscount(word)) for word

in set(words)])

return word_freq

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wc = word_count(text)

for wordcount in wcitems()

print word t t count

Jaganadh G Elements of Text Mining

Finding Word Length

Word Length

def word_length(text)

words = tokenize(text)

word_length =

[word_length__setitem__(len(word)1 +

word_lengthget(len(word)0)) for word in words]

return word_length

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wl = word_length(text)

for length count in wlitems()

print There are d words of length d (count length)

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 7: Elements of Text Mining Part - I

Counting Words

Word Count - Python

def word_count(text)

words = tokenize(text)

word_freq = dict([(word wordscount(word)) for word

in set(words)])

return word_freq

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wc = word_count(text)

for wordcount in wcitems()

print word t t count

Jaganadh G Elements of Text Mining

Finding Word Length

Word Length

def word_length(text)

words = tokenize(text)

word_length =

[word_length__setitem__(len(word)1 +

word_lengthget(len(word)0)) for word in words]

return word_length

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wl = word_length(text)

for length count in wlitems()

print There are d words of length d (count length)

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 8: Elements of Text Mining Part - I

Finding Word Length

Word Length

def word_length(text)

words = tokenize(text)

word_length =

[word_length__setitem__(len(word)1 +

word_lengthget(len(word)0)) for word in words]

return word_length

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wl = word_length(text)

for length count in wlitems()

print There are d words of length d (count length)

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 9: Elements of Text Mining Part - I

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 10: Elements of Text Mining Part - I

Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)

words = tokenize(text)

wc = word_count(text)

propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wp = word_propo(text)

for word propo in wpitems()

print word tt propo

Jaganadh G Elements of Text Mining

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 11: Elements of Text Mining Part - I

Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 12: Elements of Text Mining Part - I

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 13: Elements of Text Mining Part - I

Words Types and Ratio

Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)

words = tokenize(text)

ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio

Jaganadh G Elements of Text Mining

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 14: Elements of Text Mining Part - I

Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 15: Elements of Text Mining Part - I

Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 16: Elements of Text Mining Part - I

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 17: Elements of Text Mining Part - I

Plotting top N words

Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np

import matplotlibpyplot as plt

def plot_freq_tag(text)

tfw = top_words(text n= 50)

words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_freq(text)

Jaganadh G Elements of Text Mining

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 18: Elements of Text Mining Part - I

Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 19: Elements of Text Mining Part - I

Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np

import matplotlibpyplot as plt

def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

plot_hist(text)

Jaganadh G Elements of Text Mining

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 20: Elements of Text Mining Part - I

Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 21: Elements of Text Mining Part - I

Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified

Jaganadh G Elements of Text Mining

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 22: Elements of Text Mining Part - I

Lexical Dispersion Plot

Lexical Dispersion Plot plot from GPL text

Jaganadh G Elements of Text Mining

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 23: Elements of Text Mining Part - I

Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)

words = tokenize(text)

doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud

Jaganadh G Elements of Text Mining

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 24: Elements of Text Mining Part - I

Tag Cloud

Tag cloud from GPL text

Jaganadh G Elements of Text Mining

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 25: Elements of Text Mining Part - I

Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044

Jaganadh G Elements of Text Mining

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 26: Elements of Text Mining Part - I

Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix

Jaganadh G Elements of Text Mining

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 27: Elements of Text Mining Part - I

Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1

matrix[word2][word1] += 1

return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)

Jaganadh G Elements of Text Mining

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 28: Elements of Text Mining Part - I

Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted

Jaganadh G Elements of Text Mining

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 29: Elements of Text Mining Part - I

Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words

Jaganadh G Elements of Text Mining

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 30: Elements of Text Mining Part - I

Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless

Jaganadh G Elements of Text Mining

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 31: Elements of Text Mining Part - I

Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 32: Elements of Text Mining Part - I

Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 33: Elements of Text Mining Part - I

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 34: Elements of Text Mining Part - I

Bag of Words

Documents

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo

Jaganadh G Elements of Text Mining

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 35: Elements of Text Mining Part - I

Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1

Jaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 36: Elements of Text Mining Part - I

Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 37: Elements of Text Mining Part - I

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)

Jaganadh G Elements of Text Mining

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 38: Elements of Text Mining Part - I

TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise

Jaganadh G Elements of Text Mining

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 39: Elements of Text Mining Part - I

TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space

Jaganadh G Elements of Text Mining

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 40: Elements of Text Mining Part - I

TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =

1 if x = t0 otherwise

Exampletf(primejohnprime d1) = 1

Jaganadh G Elements of Text Mining

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 41: Elements of Text Mining Part - I

Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))

V ~d2 = (tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)

Jaganadh G Elements of Text Mining

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 42: Elements of Text Mining Part - I

Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]

Jaganadh G Elements of Text Mining

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 43: Elements of Text Mining Part - I

Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space

Jaganadh G Elements of Text Mining

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 44: Elements of Text Mining Part - I

Vector Normalization

Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 45: Elements of Text Mining Part - I

Vector Normalization

L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm

Jaganadh G Elements of Text Mining

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 46: Elements of Text Mining Part - I

Vector Normalization

v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10

Jaganadh G Elements of Text Mining

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 47: Elements of Text Mining Part - I

IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0

Jaganadh G Elements of Text Mining

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 48: Elements of Text Mining Part - I

Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 49: Elements of Text Mining Part - I

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 50: Elements of Text Mining Part - I

TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)

Jaganadh G Elements of Text Mining

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 51: Elements of Text Mining Part - I

TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)

Jaganadh G Elements of Text Mining

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 52: Elements of Text Mining Part - I

TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2

Jaganadh G Elements of Text Mining

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 53: Elements of Text Mining Part - I

TF-IDF

Practice with Python and sklearn 7

from sklearnfeature_extractiontext import

CountVectorizer TfidfTransformer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])

7httpscikit-learnorgJaganadh G Elements of Text Mining

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 54: Elements of Text Mining Part - I

N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 55: Elements of Text Mining Part - I

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 56: Elements of Text Mining Part - I

Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 57: Elements of Text Mining Part - I

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 58: Elements of Text Mining Part - I

Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]

Jaganadh G Elements of Text Mining

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 59: Elements of Text Mining Part - I

N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 60: Elements of Text Mining Part - I

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 61: Elements of Text Mining Part - I

Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)

Jaganadh G Elements of Text Mining

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 62: Elements of Text Mining Part - I

Mutual Information

from __future__ import division

import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 63: Elements of Text Mining Part - I

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 64: Elements of Text Mining Part - I

t-score

t-score

Statistical test to measure strength of word association

t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)

Jaganadh G Elements of Text Mining

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 65: Elements of Text Mining Part - I

t-score

from __future__ import division

import math

def tscore(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

tsc =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc

Jaganadh G Elements of Text Mining

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 66: Elements of Text Mining Part - I

Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification

Jaganadh G Elements of Text Mining

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 67: Elements of Text Mining Part - I

Document Classification

Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)

Jaganadh G Elements of Text Mining

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 68: Elements of Text Mining Part - I

Document Classification

A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 69: Elements of Text Mining Part - I

Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed

Jaganadh G Elements of Text Mining

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 70: Elements of Text Mining Part - I

Naıve Bayes Classification

Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)

Jaganadh G Elements of Text Mining

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 71: Elements of Text Mining Part - I

Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 2 0 1 0 Terrorism

D3 0 1 0 2 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 2 2 2 Entertainment

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 72: Elements of Text Mining Part - I

Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12

Jaganadh G Elements of Text Mining

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 73: Elements of Text Mining Part - I

Building Naıve Bayes Model

To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555

Jaganadh G Elements of Text Mining

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 74: Elements of Text Mining Part - I

Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)

Jaganadh G Elements of Text Mining

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 75: Elements of Text Mining Part - I

Testing the NB model

P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)

= 05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111

= 05times 00030times 01111times 00123times 1times 1times 01111= 227times 10minus7

The document has classified as rdquoTerrorismrdquo because it got the highest value

Jaganadh G Elements of Text Mining

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 76: Elements of Text Mining Part - I

Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf

Jaganadh G Elements of Text Mining

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 77: Elements of Text Mining Part - I

Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 78: Elements of Text Mining Part - I

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf

Jaganadh G Elements of Text Mining

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 79: Elements of Text Mining Part - I

Multi-variate Bernoulli Model

Multi-variate Bernoulli Model

If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like

Training Docs kill bomb kidnap music movie tv C

D0 1 1 1 0 0 1 Terrorism

D1 1 1 1 0 0 0 Terrorism

D2 1 1 1 0 1 0 Terrorism

D3 0 1 0 1 1 1 Entertainment

D4 0 0 1 1 1 0 Entertainment

D5 0 0 0 1 1 1 Entertainment

Here you can note that the individual word frequency has been replaced by presence or absenceof the word

Jaganadh G Elements of Text Mining

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 80: Elements of Text Mining Part - I

Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model

Jaganadh G Elements of Text Mining

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 81: Elements of Text Mining Part - I

Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins

Jaganadh G Elements of Text Mining

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 82: Elements of Text Mining Part - I

Support Vector Machine

Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines

Jaganadh G Elements of Text Mining

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 83: Elements of Text Mining Part - I

Support Vector Machine

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 84: Elements of Text Mining Part - I

Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 85: Elements of Text Mining Part - I

Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnsvm import LinearSVC

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoLinearSVC())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]

Jaganadh G Elements of Text Mining

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 86: Elements of Text Mining Part - I

Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 87: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 88: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 89: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 90: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 91: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

ErrorRate = 77+225562+77+225+436 = 023

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 92: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Fall minus out = 7777+436 = 015

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 93: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 94: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 95: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 96: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 97: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 98: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 99: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 100: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate

Jaganadh G Elements of Text Mining

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 101: Elements of Text Mining Part - I

Evaluating Performance of a Classifier

Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)

Jaganadh G Elements of Text Mining

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 102: Elements of Text Mining Part - I

Named Entity Recognition

Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc

Jaganadh G Elements of Text Mining

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 103: Elements of Text Mining Part - I

Named Entity Recognition

Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities

Jaganadh G Elements of Text Mining

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 104: Elements of Text Mining Part - I

Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract

Jaganadh G Elements of Text Mining

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 105: Elements of Text Mining Part - I

References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp

Jaganadh G Elements of Text Mining

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining

Page 106: Elements of Text Mining Part - I

References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008

Jaganadh G Elements of Text Mining