elements of text mining part - i

Elements of Text MiningPart - I

Jaganadh Ghttpjaganadhgin

cba

Jaganadh G Elements of Text Mining

Tokenization

Tokenization

Tokenization is the process of breaking a stream of text up into words phrases symbolsor other meaningful elements called tokens The list of tokens becomes input for furtherprocessing such as parsing or text mining

Tokenizig text with Python

import re

def tokenize(text)

tokenizer = recompile(rsquoWrsquo)

return tokenizersplit(textlower())

doc = John likes to watch movies Mary likes too

words = tokenize(doc)

print words


Tokenization

Tokenization



import re

def tokenize(text)





print words


Twokenization

Rise of social media introduced new orthographic patterns in digital text Typical example is a tweet wherepeople use abbreviated forms of words emoticons hash-tags etc Generic text tokenization techniques wont yieldgood result in separating words in social media text like tweets A good social media tokenizer has to take care ofemoticons hash-tags shortened urls etc

Social media tokenization with Python using happyfuntokenizing 1

from happyfuntokenizing import Tokenizer

def twokenize(tweetpc=True)

twokenizer = Tokenizer(preserve_case=pc)

return twokenizertokenize(tweet)

tweet = RT USER Relevant 2 clinical text gt Recursive neural networks

Deep Learning Natural Language Processing NLProc httptco

twokens = tokenize(tweet)

1httpsbitbucketorgjaganadhgtwittertokenize


Sentence Tokenization

Heuristic sentence boundary detection algorithm 2

Place putative sentence boundaries after all occurrences of ( and maybe - )

Move the boundary after following quotation marks if any

Disqualify a period boundary in the following circumstances

If it is preceded by a known abbreviation of a sort that does not normally occur wordfinally but is commonly followed by a capitalized proper name such as Prof or vsIf it is preceded by a known abbreviation and not followed by an uppercase word Thiswill deal correctly with most usages of abbreviations like etc or Jr which can occursentence medially or finally

Disqualify a boundary with a or if

It is followed by a lowercase letter (or a known name)

Regard other putative sentence boundaries as sentence boundaries

2Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999



Sentence Tokenization with Python and NLTK

from nltkdata import load

tokenizer = load(rsquotokenizerspunktenglishpicklersquo)

text = How can this be implemented There are a lot of subtleties

such as dot being used in abbreviations

sents = tokenizertokenize(text)

for sent in sents

print sent


Counting Words

Word Count - Python

def word_count(text)

words = tokenize(text)

word_freq = dict([(word wordscount(word)) for word

in set(words)])

return word_freq

text = How can this be implemented There are a lot of

subtleties such as dot being used in abbreviations

wc = word_count(text)

for wordcount in wcitems()

print word t t count


Finding Word Length

Word Length

def word_length(text)


word_length =

[word_length__setitem__(len(word)1 +

word_lengthget(len(word)0)) for word in words]

return word_length



wl = word_length(text)

for length count in wlitems()

print There are d words of length d (count length)


Word Proportion

Word Proportion

Let C be a corpus where (w1 w2 w3 wn) are the wordsSIZE(C) = length(tokens(C)) simply total number of words

so p(wi C) = f(wiC)SIZE(C) where f(wi C) is the frequency of wi in C

Finding Word Proportion

from __future__ import division

def word_propo(text)



propo = dict([(word wc[word]len(words)) for word

in set(words)])

return propo



wp = word_propo(text)

for word propo in wpitems()

print word tt propo


Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo


Words Types and Ratio

Words and Types

Words and valid tokens from a given corpus CTypes are total distinct tokens in a given corpus CExampleC = rdquoI shot an elephant in my pajamas He saw the fine fat trout in the brookrdquoThere are 16 words and 15 types in Cwords = (rsquoirsquo rsquoshotrsquo rsquoanrsquo rsquoelephantrsquo rsquoinrsquo rsquomyrsquo rsquopajamasrsquo rsquohersquo rsquosawrsquo rsquothersquo rsquofinersquo rsquofatrsquo rsquotroutrsquo rsquoinrsquo rsquothersquorsquobrookrsquo)types = (rsquoshotrsquo rsquoirsquo rsquosawrsquo rsquoelephantrsquo rsquobrookrsquo rsquofinersquo rsquoanrsquo rsquofatrsquo rsquoinrsquo rsquomyrsquo rsquothersquo rsquohersquo rsquopajamasrsquo rsquotroutrsquo)



Word Type Ratio

WTR(C) = WC(C)C(T ) where WC(C) is total number of words in corpus C and C(T ) is the

count of types in corpus C

Finding Word Type Ratio

def word_type_ratio(text)


ratio = len(words) len(set(words))

return ratio

text = I shot an elephant in my pajamas He saw the fine

fat trout in the brook

ratio = word_type_ratio(text)

print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words

Python code to find top N words from a text

from operator import itemgetter

def top_words(textn=50)

wordfreq = word_count(text)

topwords = sorted(wordfreqiteritems() key = itemgetter(1)

reverse=True)[n]

return topwords

text = open(rsquogpl-20txtrsquorsquorrsquo)read()

topwords = top_words(textn=50)

for word count in topwords

print s t d (wordcount)


Plotting top N words

Python code to plot top 20 words from a text

import numpy as np

import matplotlibpyplot as plt

def plot_freq(text)

tfw = top_words(text n= 20)

x = range(len(tfw))

np = len(tfw)

y = []

for item in range(np)

y = y + [tfw[item][1]]

pltplot(xyrsquoborsquols=rsquodottedrsquo)

pltxticks(range(0 len(words) + 1 1))

pltyticks(range(0 max(y) + 1 10))

pltxlabel(Word Ranking)

pltylabel(Word Frequency)

pltshow()


plot_freq(text)


Top N Words

Plot of top 50 words from GPL v2 (without filtering stop words)



Python code to plot top 50 words from a text This plot will show words in the plot

import numpy as np


def plot_freq_tag(text)


words = [tfw[i][0] for i in range(len(tfw))]

x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()

ax = figadd_subplot(111xlabel=Word Rankylabel=Word Freqquncy)

axset_title(rsquoTop 50 wordsrsquo)

axplot(x y rsquogo-rsquols=rsquodottedrsquo)



for i label in enumerate(words)

plttext (x[i] y[i] label rotation=45)

pltshow()


plot_freq(text)


Top N Words



Plotting histogram of top N words

Python code to plot histogram of top 50 words

import numpy as np


def plot_hist(text)

tw = top_words(text)

words = [tw[i][0] for i in range(len(tw))]

freq = [tw[j][1] for j in range(len(tw))]

pos = nparange(len(words))

width = 10

ax = pltaxes(frameon=True)

axset_xticks(pos)

axset_yticks(range(0max(freq)10))

axset_xticklabels(wordsrotation=rsquoverticalrsquofontsize=9)

pltbar(posfreqwidth color=rsquobrsquo)

pltshow()


plot_hist(text)


Histogram

Histogram of top 50 words from GPL v2 (without filtering stop words)


Lexical Dispersion Plot

A lexical dispersion plot shows position of a word in given text I have created somePython code to generate lexical dispersion plot with reference to a given word list3

def dispersion_plot(textwords)

wordst = tokenize(text)

points = [(xy) for x in range(len(wordst))

for y in range(len(words)) if wordst[x] == words[y]]

if points

xy = zip(points)

else

x = y = ()

pltplot(xygoscalex=2)

pltyticks(range(len(words))wordscolor=b)

pltylim(-1len(words))

plttitle(Lexical Dispersion Plot)

pltxlabel(Word Offset)

pltshow()

gpl = open(rsquogpl-20txtrsquorsquorrsquo)read()

dispersion_plot(gpl[rsquosoftwarersquorsquolicensersquorsquocopyrsquorsquognursquorsquoprogramrsquorsquofreersquo])

3Code taken fromhttpnltkgooglecodecomsvntrunkdocapinltkdrawdispersion-pysrchtml and modified



Lexical Dispersion Plot plot from GPL text


Tag Cloud

A tag cloud (word cloud or weighted list in visual design) is a visual representation for text datatypically used to depict keyword meta-data (tags) on websites or to visualize free form textThere is an interesting python tool to generate tag clouds from text called pytagcloud 4 herecomes an example of creating tag cloud from first 100 words from GPL text

from pytagcloud import create_tag_image make_tags

from pytagcloudlangcounter import get_tag_counts

def create_tag_cloud(text)


doc = join(d for d in words[100])

tags = make_tags(get_tag_counts(doc) maxsize=80)

create_tag_image(tags rsquogplpngrsquo size=(900 600)

fontname=rsquoPhilosopherrsquo)


create_tag_cloud(gpl)

4httpsgithubcomatizoPyTagCloud


Tag Cloud

Tag cloud from GPL text


Word co-occurrence

Word co-occurrence analysis is used to construct lexicon ontologies etc In general itaims to find similarities between word pairs

A word co-occurrence matrix is a square of N timesN matrix where N corresponds totalnumber of unique words in a corpus A cell mij contains the number of times mi

co-occur with word mj with in a specific contextmdash a natural unit such as a sentence or acertain window of m words Note that the upper and lower triangles of the matrix areidentical since co-occurrence is a symmetric relation a

aJimmy LinScalable Language Processing Algorithms for the Masses A Case Study inComputing Word Co-occurrence Matrices with MapReducewwwaclweborganthologyD08-1044


Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n

wn mn1 mn2 mn3 mnnShape of a co-occurrence matrix


Word co-occurrence

Finding co-occurrence matrix with Python

def cooccurrence_matrix_corpus(corpus)

matrix = defaultdict(lambda defaultdict(int))

for corpora in corpus

for i in xrange(len(corpora)-1)

for j in xrange(i+1 len(corpora))

word1 word2 = [corpora[i]corpora[j]]

matrix[word1][word2] += 1


return matrix

corpus = [[rsquow1rsquorsquow2rsquorsquow3rsquo][rsquow4rsquorsquow5rsquorsquow6rsquo]]

ccm = cooccurrence_matrix_corpus(corpus)


Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first

Word co-occurrence visualization from 100 positive movie reviews The plot show top first word and top four

associated words For each associated words again associated words are plotted


Stop Words

Stop Words

In computing stop words are words which are filtered out prior to or after processing ofnatural language data (text) In terms of linguistics these words are called as functionwords Words like rsquoarsquo rsquoanrsquo rsquothersquo are examples for stop words There is no defined set ofstop words available Different applications and research groups uses different sets o stopwordsGenerally stop words are omitted in text mining process The frequency of stop wordswill be very high in any corpus compared to content wordsPointers to some good stopword list is available at httpenwikipediaorgwikiStop_words


Stop Words Filter

def stop_filter(words)

stops = [rsquoirsquo rsquomersquo rsquomyrsquo rsquomyselfrsquo rsquowersquo rsquoourrsquo rsquooursrsquo rsquoourselvesrsquo rsquoyoursquo

rsquoyourrsquo rsquoyoursrsquo rsquoyourselfrsquo rsquoyourselvesrsquo rsquohersquo rsquohimrsquo rsquohisrsquo rsquohimselfrsquo rsquoshersquo

rsquoherrsquo rsquohersrsquo rsquoherselfrsquo rsquoitrsquo rsquoitsrsquo rsquoitselfrsquorsquotheyrsquo rsquothemrsquo rsquotheirrsquo rsquotheirsrsquo

rsquothemselvesrsquo rsquowhatrsquo rsquowhichrsquo rsquowhorsquo rsquowhomrsquo rsquothisrsquo rsquothatrsquo rsquothesersquo rsquothosersquo

rsquoamrsquo rsquoisrsquo rsquoarersquo rsquowasrsquo rsquowerersquorsquobersquo rsquobeenrsquo rsquobeingrsquo rsquohaversquo rsquohasrsquo rsquohadrsquo

rsquohavingrsquo rsquodorsquo rsquodoesrsquo rsquodidrsquo rsquodoingrsquo rsquoarsquo rsquoanrsquo rsquothersquo rsquoandrsquo rsquobutrsquo rsquoifrsquo rsquoorrsquo rsquobecausersquo

rsquoasrsquo rsquountilrsquo rsquowhilersquo rsquoofrsquo rsquoatrsquo rsquobyrsquo rsquoforrsquorsquowithrsquo rsquoaboutrsquo rsquoagainstrsquo rsquobetweenrsquo rsquointorsquo

rsquothroughrsquo rsquoduringrsquo rsquobeforersquo rsquoafterrsquo rsquoaboversquo rsquobelowrsquo rsquotorsquo rsquofromrsquo rsquouprsquo rsquodownrsquo rsquoinrsquo rsquooutrsquo

rsquoonrsquo rsquooffrsquo rsquooverrsquo rsquounderrsquo rsquoagainrsquo rsquofurtherrsquorsquothenrsquo rsquooncersquo rsquoherersquo rsquotherersquo rsquowhenrsquo rsquowherersquo

rsquowhyrsquo rsquohowrsquo rsquoallrsquo rsquoanyrsquo rsquobothrsquo rsquoeachrsquo rsquofewrsquo rsquomorersquo rsquomostrsquo rsquootherrsquo rsquosomersquo rsquosuchrsquo rsquonorsquo

rsquonorrsquorsquonotrsquo rsquoonlyrsquo rsquoownrsquo rsquosamersquo rsquosorsquo rsquothanrsquo rsquotoorsquo rsquoveryrsquo rsquosrsquo rsquotrsquo rsquocanrsquo rsquowillrsquo rsquojustrsquo

rsquodonrsquo rsquoshouldrsquo rsquonowrsquo]

stopless = [word in words if word not in stops]

return stopless


Bag of Words

The bag-of-words model is a simplifying representation used in natural languageprocessing and information retrieval (IR) In this model a text (such as a sentence or adocument) is represented as an un-ordered collection of words disregarding grammarand even word order a

Analyzing text by only analyzing frequency of words is called as bag of words model

ahttpenwikipediaorgwikiBag of words model


Bag of Words

Example

d1 John likes to watch movies Mary likes too

d2 John also likes to watch football games

rsquofootballrsquo 0 rsquowatchrsquo 6 rsquomoviesrsquo 5 rsquogamesrsquo 1

rsquolikesrsquo 3 rsquojohnrsquo 2 rsquomaryrsquo 4

after removing stopwords

[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]

each entry of the vectors refers to count of

the corresponding entry in the dictionary

a

aExample taken from httpenwikipediaorgwikiBag of words model


Bag of Words

Documents



Vocabulary Index

V I(t) =

0 if t is rsquofootballrsquo1 if t is rsquogamesrsquo2 if t is rsquojohnrsquo3 if t is rsquolikesrsquo4 if t is rsquomaryrsquo5 if t is rsquomoviesrsquo6 if t is rsquowatchrsquo


Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

football games john likes mary movies watch

doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words

Creating Bag of Words with Python and sklearn 5

from sklearnfeature_extractiontext import

CountVectorizer

vectorizer = CountVectorizer(analyzer=rsquowordrsquo

min_n=1stop_words=rsquoenglishrsquo)

docs = (rsquoJohn likes to watch movies Mary

likes toorsquorsquoJohn also likes to watch football gamesrsquo)

bow = vectorizerfit_transform(docs)

print vectorizervocabulary_

print bowtoarray()

5httpscikit-learnorgJaganadh G Elements of Text Mining

Bag of Words

Creating Bag of Words with Python Just for sample -(

def bag_of_words(docs)

stops = [rsquotorsquorsquotoorsquorsquoalsorsquo]

token_list = [tokenize(doc) for doc in docs]

vocab = list(set(token_list[0])union(token_list))

vocab =[v for v in vocab if v not in stops and len(v) gt 1]

vocab_idex = dict( [ ( word vocabindex(word) ) for word

in vocab] )

bow = [[tokenscount(word) for word in vocab_idexkeys()]

for tokens in token_list]

print vocab_idex

for bag in bow

print bag

d = (John likes to watch movies Mary likes too

John also likes to watch football games)

bag_of_words(d)


TF-IDF

Tfndashidf term frequencyndashinverse document frequency is a numerical statistic which reflectshow important a word is to a document in a collection or corpus

TF-IDF

tf minus idf(t) = tf(t d)times idf(t)

where rsquotrsquo is a term in document rsquodrsquotf(t d) how many times the term rsquotrsquo is present in rsquodrsquo

tf(t d) =sumxisind

fr(x t)

where

fr(x t) =

1 if x = t0 otherwise


TF-IDF

TF-IDF

andidf(t) = log |D|

1+|dtisind|where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0|D| is the cardinality of the document space


TF

TF

tf(t d) =sumxisind

fr(x t)

fr(x t) is a simple function

fr(x t) =


Exampletf(primejohnprime d1) = 1


Document Vector

To create a document vector space

V ~dn = (tf(t1 dn) tf(t2 dn) tf(tn dn))

To represent rsquod1rsquo and rsquod2rsquo as vectors

V ~d1 = (tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))


which evaluates toV ~d1 = (0 0 1 2 1 1 1)

V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix

The document vectors can be represented as matrix

M|D|xF

where |D| is the cardinality of the document space

M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]


Vector Normalization

Normalized Vector

A normalized vector is represented as v = ~v~vp where v is is the unit vector or the

normalized vector the ~v is the vector going to be normalized and the ~vp is the norm(magnitude or length) of the vector ~v in the LP space (Lebesgue spaces) a

ahttpenwikipediaorgwikiLp space



Length of a vector is calculated using the Euclidean norm 6Non-normalized vector ~v = (v1 v2 v3 vn)Length of vector ~v =

radicv2

1 + v22 + v2

2 + + v2n

With norm ~v||p = (|v1|p + |v2|p + |v3|p + + |vn|p)1p

It can be simplified as

~v||p = (

nsumi=1

|~vi|p)1p

6httpmathworldwolframcomL2-NormhtmlJaganadh G Elements of Text Mining


L2 Norm

The norm which we apply here is L2 Norm which is also called as Euclidean norm a

It is a common norm used to measure the length of a vector where p = 2

ahttpenwikipediaorgwikiNorm (mathematics)Euclidean norm



v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8

vd1 = ( 0radic8 0radic

8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)

vd1 = (00 00 03535 07071 03535 03535 03535)Now our normalized vector vd1 has now a L2-norm vd12 = 10


IDF

IDF

idf(t) = log |D|1+|dtisind|

where |d t isin d| is count(d) where rsquotrsquo is present and tf(t d) 6= 0


Finding IDF

idf(ti) = log|D|

1 + |d ti isin d|= log

2

1= 069314718

idf(football) = log 21+1

= 00

idf(games) = log 21+1

= 00

idf(john) = log 21+2

= minus040546510810816444

idf(likes) = log 21+2

= minus040546510810816444

idf(mary) = log 21+1

= 00

idf(movies) = log 21+1

= 00

idf(watch) = log 21+1

= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight

Finding TF-IDF weight

M|D|timesF times Midf

[tf(t1 d1) tf(t2 d1) tf(t3 d1) tf(t4 d1) tf(t5 d1) tf(t6 d1) tf(t7 d1))tf(t1 d2) tf(t2 d2) tf(t3 d2) tf(t4 d2) tf(t5 d2) tf(t6 d2) tf(t7 d2))

]x

idf(t1) 0 0 0 0 0 00 idf(t2) 0 0 0 0 00 0 idf(t3) 0 0 0 00 0 0 idf(t4) 0 0 00 0 0 0 idf(t5) 0 00 0 0 0 0 idf(t6) 00 0 0 0 0 0 idf(t6)


TF-IDF weight




]x



TF-IDF weight

tf(t1 d1) times idf(t1) tf(t2 d1) times idf(t2) tf(t3 d1) times idf(t3)tf(t4 d1) times idf(t4 tf(t5 d1) times idf(t5 tf(t6 d1) times idf(t6 tf(t7 d1) times idf(t7)tf(t1 d2) times idf(t1) tf(t2 d2) times idf(t2) tf(t3 d2) times idf(t3)tf(t4 d2) times idf(t4 tf(t5 d2) times idf(t5 tf(t6 d2) times idf(t6 tf(t7 d2) times idf(t7)


TF-IDF Normalization

L2 Normalization

Mtfminusidf =MtfminusidfMtfminusidf2


TF-IDF

Practice with Python and sklearn 7


CountVectorizer TfidfTransformer






freq_term_matrix =vectorizertransform(docs)

tfidf = TfidfTransformer(norm=l2)

tfd = tfidffit(freq_term_matrix)

print IDF tfidfidf_

tf_idf_matrix = tfidftransform(freq_term_matrix)

print tf_idf_matrixtodense()

for wf in zip(vectorizervocabulary_tfdidf_)

print rsquor =gt rrsquo (w tfdidf_[f])


N-Grams

N-Gram

In the fields of computational linguistics and probability an n-gram is a contiguoussequence of n items from a given sequence of text or speech An n-gram could be anycombination of letters However the items in question can be phonemes syllablesletters words or base pairs according to the applicationa

Unigrams are single words

Bigrams are sequences of two words

Trigrams are sequences of three words

ahttpenwikipediaorgwikiN-gram


Bigrams

Bigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus1)

Practice with Python

d1 = John likes to watch movies Mary likes too

words = d1lower()split()

ibigrams = [words[xx+2] for x in xrange(len(words)-2+1)]

bigrams = [ join(bigram) for bigram in ibigrams]

print bigrams

[rsquojohn likesrsquo rsquolikes torsquo rsquoto watchrsquo rsquowatch moviesrsquo

rsquomovies maryrsquo rsquomary likesrsquo rsquolikes toorsquo]


Bigrams

Bigrams







print bigrams




Trigrams

Trigrams

P (wi|w1 w2 wiminus1) asymp P (wi wiminus2 wiminus1)




itrigrams = [words[xx+3] for x in xrange(len(words)-3+1)]

trigrams = [ join(trigram) for trigram in itrigrams]

print trigrams

[rsquojohn likes torsquo rsquolikes to watchrsquo rsquoto watch moviesrsquo

rsquowatch movies maryrsquo rsquomovies mary likesrsquo rsquomary likes toorsquo]


Trigrams

Trigrams







print trigrams




N-Grams

Python code to generate N-Grams from list of words

def ngrams(wordsn=2)

grams = [ join(words[xx+n]) for x in xrange(len(words)-n+1)]

return grams

words = John likes to watch movies Mary likes too

lower()split()

bigrams = ngrams(wordsn=2)

trigrams = ngrams(wordsn=3)

print bigram

print trigrams


Mutual Information

Mutual Information

Statistical test to measure strength of word association

I(wi wj) = log2P (wiwj)P (wi)P (wj) asymp log2

NC(wiwj)C(wi)C(wj)

where C(wi) and C(wj) respective frequency of wi and wj in the corpusC(wi wj) is the frequency of bigram wi wjN is the total number of words in the corpus

I(strong tea) = log2P (strongtea)

P (strong)P (strong) asymp log2NC(strongtea)C(strong)C(tea)


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math

def mutual_info(words)

grams = ngrams(wordsn=2) ngrams function from prev slide

wordcount =

gramcount =

minfo =

[ wordcount__setitem__(word 1 +

wordcountget( word0 )) for word in words ]

[ gramcount__setitem__(gram 1 +

gramcountget( gram0 )) for gram in grams ]

for gram in grams

minfo[gram] = (mathlog( len(words) gramcount[ gram ]

wordcount[gramsplit()[0]] wordcount[ gramsplit()[1]]))

mathlog( 2 )

return minfo


t-score

t-score


t(wi wj) =mean(P (wiwj))minusmean(P (wi)mean(P (wj)radic

σ2(P (wiwj)+σ2(P (wi)σ2(P (wj))

asymp C(wiwj)minus 1NC(wi)C(wj)radic

C(wiwj)


t(strong tea) =C(strongtea)minus 1

NC(strong)C(tea)radic

C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams

tsc[gram] = (gramcount[gram] - (1len(words))

wordcount[gramsplit()[0]] wordcount[gramsplit()[1]])

mathsqrt( gramcount[gram])

return tsc


Document Classification

Document classification or document categorization is a problem in library scienceinformation science and computer science The task is to assign a document to one ormore classes or categories 8

Document classification tasks can be divided into three kinds

supervised document classification is performed by an external mechanism usuallyhuman feedback which provides the necessary information for the correctclassification of documents

semi-supervised document classification a mixture between supervised andunsupervised classification some documents or parts of documents are labeled byexternal assistance

unsupervised document classification is entirely executed without reference toexternal information

8httpenwikipediaorgwikiDocument_classification



Formal Definition

Let C = (c1 c2 c3 cm) be a set of pre-defined categoriesLet D = (d1 d2 d3 dn) be a set of documents to be classifiedGiven a training set T of labeled documents

langd crang

wherelangd crangisin D times C using a learning

algorithm we wish to learn a classifier or a classifier function γ that maps document toclasses γ D rarr CA supervised learning algorithm Γ takes training set T and emits learned classificationfunction γ Γ(T ) = γ

γ(ci dj) =

1 if dj belongs to ci0 otherwise

Main approaches in document classification are

Naıve Bayes (NB)

Support Vector Machines (SVM)



A supervised document classification pipeline 9

9Image taken from httpwwwpython-courseeutext_classification_introductionphp


Naıve Bayes Classification

Naıve Bayes is a simple probabilistic classifier based on applying Bayesrsquo theorem orBayesrsquo ruleBayesrsquo rule P (H|E) = P (E|H)timesP (H)

P (E)

The basic idea of Bayesrsquos rule is that the outcome of a hypothesis or an event (H) can bepredicted based on some evidences (E) that can be observedP (H) is called as priori probability This is the probability of an event before theevidence is observedP (H|E) is called as posterior probability This is the probability of an event after theevidence is observed



Let H be the event of raining and E be the evidence of dark cloud then we haveP (raining|dark cloud) = P (dark cloud|raining)timesP (raining)

P (dark cloud)For multiple evidencesP (H|E1 E2 En) = P (E1E2En|H)timesP (H)

P (E1E2En)With the independence assumption we can rewrite the Bayesrsquos rule as followsP (H|E1 E2 En) = P (E1|H)timesP (E2|H)timesP (En|H)timesP (H)

P (E1E2En)


Naıve Bayes Applied in Text Classification

Now letrsquos try to train a Naıve Bayes classifier to perform text classificationC = (terrorism entertainment)D = (D0 D1 D2 D3 D4 D5)BoW = (kill bomb kidnapmusicmovie tv) (vocabulary)The pre-processed documents for training will look like

Training Docs kill bomb kidnap music movie tv C

D0 2 1 3 0 0 1 Terrorism



D3 0 1 0 2 1 1 Entertainment




Building Naıve Bayes Model

Naıve Bayes Model for the training set will be like [4pt]

|V | C P (Ci) ni P (kill|Ci) P (bomb|Ci) P (kidnap|Ci) P (music|Ci) P (movie|Ci) P (tv|Ci)6 T 05 15 0238095238 019047619 033333333 0047619048 095238095 0095238095

E 05 12 005555566 011111111 011111111 033333333 027777778 011111111

|V | = the number of vocabularies = 6P (Ci) = the priori probability of each class = number of documents in a class

number of all documentsni = the total number of word frequency in each classnterrorism = 2 + 1 + 3 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 = 15nentertainment = 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 = 12P (wi|ci) = the conditional probability of keyword occurrence given a classExampleP (kill|Terrorism) = (2+1+1)

15 = 415

P (kill|Entertainment) = (0+0+0)12 = 0

12



To avoid the ldquozero frequencyrdquo problem we apply Laplace estimation by assuming auniform distribution over all words as followsP (kill|Terrorism) = (2+1+1+1)

(15+|V |) = 521 = 02380

P (kill|Entertainment) = (0+0+0+1)(12+|V |) = 1

18 = 00555


Testing the NB model

Our test document isTest Docs kill bomb kidnap music movie tv C

Dt 2 1 2 0 0 1 To find the posterior probability

P (ci|W ) = P (ci)timesVprodj=1

P (wj |ci)



P (Terrorism|W ) = P (Terrorism)times P (kill|Terrorism)times P (bomb|Terrorism)times P (kidnap|Terrorism)timesP (music|Terrorism)xP (movie|Terrorism)times P (tv|Terrorism)

= 05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521

= 05times 00566times 01904times 01110times 1times 1times 00952= 57times 10minus5

P (Entertainment|W ) = P (Entertainment)times P (kill|Entertainment)times P (bomb|Entertainment)timesP (kidnap|Entertainment)times P (music|Entertainment)times P (movie|Entertainment)times P (TV |Terrorism)



The document has classified as rdquoTerrorismrdquo because it got the highest value


Preventing Underflow

The probability score assigned to the test document is very small In real world situations we willtrain the classifier with thousands of documents In such cases the conditional probability valueswill be too low for the CPU to handle This problem is called as Underflow T resolve theproblem we can take logarithm on the probabilities likeP (Terrorism|W ) = log(05times 023802 times 019041 times 033332 times 004760 times 009520 times 009521)= log(05) + 2log(02380) + 1log(01904) + 2log(03333) + 0log(00476) + 0log(00952) + 1log(00952)= ndash03010ndash12468ndash07203ndash09543 + 0 + 0ndash10213= ndash42437P (Entertainment|W ) = log(05times 005552 times 011111 times 011112 times 033330 times 027770 times 011111)= log(05) + 2log(00555) + 1log(01111) + 2log(01111) + 0log(03333) + 0log(02777) + 1log(01111)== ndash03010ndash2511ndash09542ndash19085 + 0 + 0ndash09542= ndash66289

After handling the underflow problem our system classified the test document as rdquoTerrorismrdquoFrom the final probability score you can observe that the scores are scaled nicely

The section on Naıve Bayes Classifier is prepared from the notes rdquoA Tutorial on Naive BayesClassificationrdquo by rdquoChoochart Haruechaiyasakrdquosuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf


Naıve Bayes Classifier

There are two different ways to setup a Naıve Bayes Classifier

Multi-variate Bernoulli Model

Multinomial Model




Given a vocabulary V each dimension of the space t t isin 1 |V | corresponds to wordwt from the vocabularyDimension t of the vector for document di is written Bit and iseither 0 or 1 indicating whether word wt occurs at least once in the document Withsuch a document representation we make the naive Bayes assumption that theprobability of each word occurring in a document is independent of the occurrence ofother words in a documenta

aA Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum andKamal Nigam httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf




If we are adopting Multi-variate Bernoulli Model for text classification our documentspace representation will be like








Here you can note that the individual word frequency has been replaced by presence or absenceof the word


Multinomial Model

Multinomial Model

In contrast to the multi-variate Bernoulli event model the multinomial model captures word frequencyinformation in documentsIn the multinomial model a document is an ordered sequence of word eventsdrawn from the same vocabulary V We again make a similar naive Bayes assumption that theprobability of each word event in a document is independent of the wordrsquos context and position in thedocument

In the NB example which we worked out we applied the multinomial model In the model we used simple bag of

words representation We an use smoothed bag of words mode like TF-IDF in the multinomial naive bayes model


Support Vector Machine

Support Vector Machine is an abstract learning machine which learns from a trainingset attempts to generalize and makes correct predictions on new data Consider atraining set (xi yi)ni=1 where xi isin Rp (input feature vector) and yi isin 1minus1 iscorresponding label whether (yi = +1) or (yi = minus1) To start with we assume that ourinput feature vectors are linearly separable that is there exists a functionf(x) = 〈w x〉+ b where w isin Rp (weight vector) and b isin R (bias) such that

〈w xi〉+ b gt 0 for yi = +1〈w xi〉+ b lt 0 for yi = minus1

〈w x〉+ b = 0 is the decision hyperplane There can be multiple hyperplanes which canseparate positive and negative examples But all of them are not equal SVM tries tofind the particular hyperplane that maximizes the margin The vectors which is closer tothe maximum margin is calles as support vectors If the data is not linearly separable wehave to use Kernel tricks to find soft margins



Suppose a bunch of red square and blue rectangle figures are scattered over a table If the figures aremixed then it is not possible to separate in a linear way

If there is a clear blank space available in the table in between the square and rectangles we can say that

it is linearly separable A line which drawn in the clear space between the figuresexactly equal length

from the margin of scatter region of square and rectangle is called as separating hyperplane Everything

on the one side of the separating hyper plane belongs to one category and everything in the other side

belongs to the other category (red square and blue rectangles) A line drawn in the edges of each scatter

(square and rectangle) in an equal distance from the separating hyperplane is called maximum margin

The figures closest to the separating hyperplane are known as support vectors10 If the data is not

linearly separable we have to use kernel tricks 11

10This is just a non theoretic definition rdquojust to get an idea onlyrdquo For more referhttpwwwstatsoftcomtextbooksupport-vector-machines and bibiliography

11httpwwwstatsoftcomtextbooksupport-vector-machines




Practice Time

Letrsquos try to build a Multinomial Naıve Bayes Classifier with Python Sklearn

from sklearndatasets import load_files

from sklearnfeature_extractiontext import CountVectorizer

from sklearnfeature_extractiontext import TfidfTransformer

from sklearnpipeline import Pipeline

from sklearnnaive_bayes import MultinomialNB

dir_data = usrsharenltk_datacorporamovie_reviews replace with your path

vectorizer = CountVectorizer(analyzer = rsquowordrsquongram_range = (13)

stop_words=rsquoenglishrsquolowercase=True)

transformer = TfidfTransformer(use_idf=True)

classifier = Pipeline([(rsquovectrsquovectorizer)(rsquotfidfrsquotransformer)

(rsquoclfrsquoMultinomialNB())])

categories = [rsquoposrsquorsquonegrsquo]

training_data = load_files(dir_datacategories=categories

shuffle = True)

_ = classifierfit(training_datadata training_datatarget)

print training_datatarget_names[classifierpredict([rsquoThis is a good onersquo])]


Practice Time

Letrsquos try to build a SVM Classifier with Python Sklearn





from sklearnsvm import LinearSVC






(rsquoclfrsquoLinearSVC())])



shuffle = True)




Practice Time

Letrsquos try to build a Multi-variate Naıve Bayes Classifier with Python NLTK 12

import nltkclassifyutil

from nltkclassify import NaiveBayesClassifier

from nltkcorpus import movie_reviews

def word_feats(words)

return dict([(word True) for word in words])

negids = movie_reviewsfileids(rsquonegrsquo)

posids = movie_reviewsfileids(rsquoposrsquo)

negfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquonegrsquo) for f in negids]

posfeats = [(word_feats(movie_reviewswords(fileids=[f]))

rsquoposrsquo) for f in posids]

negcutoff = len(negfeats)34

poscutoff = len(posfeats)34

trainfeats = negfeats[negcutoff] + posfeats[poscutoff]

classifier = NaiveBayesClassifiertrain(trainfeats)

sent = This is really cool I like it

words = word_feats(sentlower()split())

print classifierclassify(words)

12This code is adopted from httpstreamhackercom20100510

text-classification-sentiment-analysis-naive-bayes-classifier Follow the link for moredetailed discussion


Evaluating Performance of a Classifier

Confusion Matrix

Confusion matrix is a specific table layout that allows visualization of the performance ofan algorithm

Consider that we built a classifier with two categories rdquoPositiverdquo and rdquoNegativerdquo Aconfusion matrix for the classifier will look like

ActualPositive Negative

PredictedPositive True Positive (TP) False Positive (FP)Negative False Negative (FN) True Negative (TN)



Accuracy of a Classifier

Accuracy = TP+TNTP+FP+FN+TN

ActualPositive Negative Total

PredictedPositive 562 77 639Negative 225 436 661Total 787 513 1300

Accuracy = 562+436562+77+225+436

= 076



Precision and Recall

Precision which indicates how many of the items that we identified were relevant

Precision = TPTP+FP

Recall which indicates how many of the relevant items that we identified It isequivalent with rdquohit raterdquo and rdquosensitivityrdquo

Recall = TPTP+FN





Positive Precision = 562562+77

= 087

Negative Precision = 436225+436

= 065

Positive Recall = 562562+225

= 071

Negative Recall = 43677+436

= 084



Error Rate

Error rate is the percentage of things done wrong

ErrorRate = FP+FNTP+FP+FN+TN



ErrorRate = 77+225562+77+225+436 = 023



Fall-out

It is a proportion of non relevant item that were mistakenly selected It is equivalentwith false positive rate (FPR)

Fall minus out = FPFP+TN



Fall minus out = 7777+436 = 015



F1 Score

In statistics the F1 score (also F-score or F-measure) is a measure of a testrsquos accuracyIt considers both the precision p and the recall r of the test to compute the score p isthe number of correct results divided by the number of all returned results and r is thenumber of correct results divided by the number of results that should have beenreturned The F1 score can be interpreted as a weighted average of the precision andrecall where an F1 score reaches its best value at 1 and worst score at 0 a

F1 Score = 2 precisionrecallprecision+recall

ahttpenwikipediaorgwikiF1_score



F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073



Positive predictive value

Positive predictive value or precision rate is the proportion of positive test results thatare true positives

Positive predictive value = TPTP+FP

Positive predictive value = 562562+77 = 087



Negative predictive value

Negative predictive value (NPV) is a summary statistic used to describe the performanceof a diagnostic testing procedure

NPV = TNTN+FN

NPV = 436436+225 = 065



Specificity or True Negative Rate

Specificity = TNFP+TN

Specificity = = 43677+436 = 084



False Discovery Rate

False discovery rate (FDR) control is a statistical method used in multiple hypothesistesting to correct for multiple comparisons

FDR = FPFP+TP

FDR = 7777+562 = 012



Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is used in machine learning as a measureof the quality of binary (two-class) classifications

MCC = TPtimesTNminusFPtimesFNradic(TP+FP )(TP+FN)(TN+FP )(TN+FN)

MCC = 562times436minus77times225radic(562+77)(562+225)(436+77)(436+225)

= 055



Receiver Operating Characteristic (ROC)

ROC is another metric for comparing predicted and actual target values in aclassification model It applies to the binary classification problemROC can be plottedas a curve on an X-Y axis The false positive rate is placed on the X axis The truepositive rate is placed on the Y axisThe top left corner is the optimal location on anROC graph indicating a high true positive rate and a low false positive rate



Area Under the Curve

The area under the ROC curve (AUC) measures the discriminating ability of a binaryclassification model The larger the AUC the higher the likelihood that an actualpositive case will be assigned a higher probability of being positive than an actualnegative case The AUC measure is especially useful for data sets with unbalancedtarget distribution (one target class dominates the other)


Named Entity Recognition


Named-entity recognition (NER) (also known as entity identification and entityextraction) is a sub-task of information extraction that seeks to locate and classify atomicelements in text into predefined categories such as the names of persons organizationslocations expressions of times quantities monetary values percentages etc



Entity Recognition with Python NLTK

from nltk import sent_tokenize ne_chunk pos_tag word_tokenize

def extract_entities(text)

entities = []

sents = sent_tokenize(text)

chunks = [ ne_chunk(pos_tag(word_tokenize(sent))

binary=True) for sent in sents]

for chunk in chunks

for tree in chunksubtrees()

if treenode == NE

entity = rsquo rsquojoin(leaf[0] for leaf in treeleaves())

entitiesappend(entity)

return entities

if __name__ == __main__

sent = Abraham Lincoln was born February 12 1809 the second child

of Thomas Lincoln and Nancy Lincoln

entities = extract_entities(sent)

print entities


Extracting Terms from Text

Extracting Terms with Python Topia Termextract 13

from topiatermextract import extract

extractor = extractTermExtractor()

text = Abraham Lincoln was born February 12 1809 the

second child of Thomas Lincoln and Nancy Lincoln

terms = extractor(text)

terms = [term[0] for term in terms]

print terms

13httppypipythonorgpypitopiatermextract


References

Chris SmithMachine Learning Text Feature Extraction (tf-idf) - PartIhttpcssdzonecomarticlesmachine-learning-text-feature Accessed on 20th Sept2012

Chris Smith Machine Learning Text Feature Extraction (tf-idf) - Part IIhttpcssdzonecomarticlesmachine-learning-text-feature-0mz=55985-pythonAccessed on 20th Sept 2012

Pierre M Nugues An Introduction to Language Processing with Perl and Prolog Springer2006

Kenneth W Church and Robert L MercerIntroduction to the Special Issue on ComputationalLinguistics Using Large Corpora aclldcupenneduJJ93J93-1001pdf Accessed on 23rd Sept2012

Roger BilisolyPractical Text Mining with Perl Wiley 2008

Text Categorization and Classificationhttpwwwpython-courseeutext_classification_introductionphp


References

Chris Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MITPress Cambridge MA May 1999

Choochart Haruechaiyasak A Tutorial on Naive Bayes Classificationsuanpalm3kmutnbacthteacherfiledlchoochart82255418560pdf Accessed on Feb 10 2012

Andrew McCallum and Kamal Nigam A Comparison of Event Models for Naive Bayes TextClassification httpwwwcscmuedu~knigampapersmultinomial-aaaiws98pdf Accessed onMarch 2 2012

Jimmy LinScalable Language Processing Algorithms for the Masses A Case Study in ComputingWord Co-occurrence Matrices with MapReduce wwwaclweborganthologyD08-1044

Christopher D Manning Prabhakar Raghavan and Hinrich Schutze Introduction to InformationRetrieval Cambridge University Press 2008


Tokenization

Tokenization



import re

def tokenize(text)





print words


Tokenization

Tokenization



import re

def tokenize(text)





print words


Twokenization






























for sent in sents

print sent


Counting Words

Word Count - Python




in set(words)])

return word_freq







Finding Word Length

Word Length



word_length =



return word_length







Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo


Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo



Words and Types




Word Type Ratio







return ratio




print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Tokenization

Tokenization



import re

def tokenize(text)





print words


Twokenization






























for sent in sents

print sent


Counting Words

Word Count - Python




in set(words)])

return word_freq







Finding Word Length

Word Length



word_length =



return word_length







Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo


Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo



Words and Types




Word Type Ratio







return ratio




print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Twokenization






























for sent in sents

print sent


Counting Words

Word Count - Python




in set(words)])

return word_freq







Finding Word Length

Word Length



word_length =



return word_length







Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo


Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo



Words and Types




Word Type Ratio







return ratio




print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References

























for sent in sents

print sent


Counting Words

Word Count - Python




in set(words)])

return word_freq







Finding Word Length

Word Length



word_length =



return word_length







Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo


Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo



Words and Types




Word Type Ratio







return ratio




print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Counting Words

Word Count - Python




in set(words)])

return word_freq







Finding Word Length

Word Length



word_length =



return word_length







Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo


Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo



Words and Types




Word Type Ratio







return ratio




print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Finding Word Length

Word Length



word_length =



return word_length







Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo


Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo



Words and Types




Word Type Ratio







return ratio




print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo


Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo



Words and Types




Word Type Ratio







return ratio




print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Word Proportion

Word Proportion









in set(words)])

return propo





print word tt propo



Words and Types




Word Type Ratio







return ratio




print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








Words and Types




Word Type Ratio







return ratio




print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








Word Type Ratio







return ratio




print ratio



Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








Word Type Ratio







return ratio




print ratio


Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Finding top N words






reverse=True)[n]

return topwords








import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References









import numpy as np


def plot_freq(text)


x = range(len(tfw))

np = len(tfw)

y = []








pltshow()


plot_freq(text)


Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Top N Words





import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References









import numpy as np





x = range(len(tfw))

np = len(tfw)

y = []



fig = pltfigure()








pltshow()


plot_freq(text)


Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Top N Words





import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References









import numpy as np


def plot_hist(text)





width = 10


axset_xticks(pos)




pltshow()


plot_hist(text)


Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Histogram









if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References













if points

xy = zip(points)

else

x = y = ()






pltshow()








Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References










Tag Cloud














Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Tag Cloud



Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Word co-occurrence






Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Word co-occourance

w1 w2 w3 wnw1 m11 m12 m13 m1n

w2 m21 m22 m23 m2n

w3 m31 m32 m33 m3n



Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Word co-occurrence










return matrix




Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Word co-occurrence

hong

toy zdenek

czech

movie

julie

like

chan

story

tango

one

sverak

martial

woody

dating

film

first




Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Stop Words

Stop Words



Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Stop Words Filter















return stopless


Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Bag of Words





Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Bag of Words

Example






[0 0 1 2 1 1 1]

[1 1 1 1 0 0 1]



a



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Bag of Words

Documents



Vocabulary Index

V I(t) =



Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Bag of Words


doc1 0 0 1 2 1 1 1

doc2 1 1 1 1 0 0 1


Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Bag of Words



CountVectorizer







print bowtoarray()


Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Bag of Words








in vocab] )



print vocab_idex

for bag in bow

print bag



bag_of_words(d)


TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







TF-IDF


TF-IDF



tf(t d) =sumxisind

fr(x t)

where

fr(x t) =



TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







TF-IDF

TF-IDF

andidf(t) = log |D|



TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







TF

TF

tf(t d) =sumxisind

fr(x t)


fr(x t) =




Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Document Vector







V ~d2 = (1 1 1 1 0 0 1)


Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Vector Space Matrix


M|D|xF


M|D|xF =

[0 0 1 2 1 1 11 1 1 1 0 0 1

]



Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








Normalized Vector







radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References









radicv2

1 + v22 + v2

2 + + v2n



~v||p = (

nsumi=1

|~vi|p)1p



L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








L2 Norm






v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








v~d1 = (0 0 1 2 1 1 1)vd1 = ~v

~vp

vd1 = v~d1

v~d12vd1 = (0012111)radic

02+02+12+22+12+12+12

vd1 = (0012111)radic8


8 1radic

8 2radic

8 1radic

8 1radic

8 1radic

8)



IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







IDF

IDF




Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Finding IDF

idf(ti) = log|D|


2

1= 069314718


= 00


= 00


= minus040546510810816444


= minus040546510810816444


= 00


= 00


= 00idf(V ) = (00-040546510810816444-040546510810816444000000)


TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







TF-IDF weight




]x



TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







TF-IDF weight




]x



TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







TF-IDF weight




L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








L2 Normalization



TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







TF-IDF












print IDF tfidfidf_






N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







N-Grams

N-Gram







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Bigrams

Bigrams







print bigrams




Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Bigrams

Bigrams







print bigrams




Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Trigrams

Trigrams







print trigrams




Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Trigrams

Trigrams







print trigrams




N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







N-Grams




return grams


lower()split()



print bigram

print trigrams


Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Mutual Information

Mutual Information



NC(wiwj)C(wi)C(wj)





Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Mutual Information


import math



wordcount =

gramcount =

minfo =





for gram in grams



mathlog( 2 )

return minfo


t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







t-score

t-score





C(wiwj)




C(strongtea)


t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







t-score

t-score





C(wiwj)




C(strongtea)


t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







t-score


import math

def tscore(words)


wordcount =

gramcount =

tsc =





for gram in grams




return tsc











Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References
















Formal Definition


langd crang



γ(ci dj) =



Naıve Bayes (NB)









P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References













P (E)







P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References











P (E1E2En)















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References




















E 05 12 005555566 011111111 011111111 033333333 027777778 011111111



15 = 415


12




(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References









(15+|V |) = 521 = 02380


18 = 00555






P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References











P (wj |ci)



















Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References
























Multinomial Model



















Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References
























Multinomial Model

Multinomial Model

























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References



























Practice Time















shuffle = True)




Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Practice Time















shuffle = True)




Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References







Practice Time
























Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








Confusion Matrix











Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References












Accuracy = 562+436562+77+225+436

= 076





Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References










Precision = TPTP+FP


Recall = TPTP+FN






= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References











= 087


= 065


= 071


= 084



Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








Error Rate





ErrorRate = 77+225562+77+225+436 = 023



Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








Fall-out





Fall minus out = 7777+436 = 015



F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








F1 Score






F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References








F1 Score Positive = 2 087071087+071 = 078

F1 Score Positive = 2 065084065+084 = 073











NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References
















NPV = TNTN+FN

NPV = 436436+225 = 065





Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References










Specificity = = 43677+436 = 084





FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References










FDR = FPFP+TP

FDR = 7777+562 = 012







= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References












= 055


















entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References























entities = []




for chunk in chunks


if treenode == NE



return entities





print entities










print terms



References








References















print terms



References








References







References








References







References







elements of text mining part - i

Education