working with text data

Data SourcesHow to retrieve the data?

Data preprocessingSome key concepts

FacebookR package

Twitter & RSentiment analysis (Based on Chris Potts tutorial )

Working with linguistic data

Ekaterina Vylomova

April 14, 2014

Ekaterina Vylomova Working with linguistic data



FacebookR package


Possible Data Sources


Possible data sources

Dictionaries and corpora

Linked Data

Social media




FacebookR package




Thesauri & Corpora

WordNets

Roget's Thesaurus

Moby Project




FacebookR package




Moby Project

Moby Hyphenator - 185,000 entries fully hyphenated

Moby Language - Word lists in �ve of the world's greatlanguages

Moby Part-of-Speech - 230,000 entries fully described bypart(s) of speech

Moby Pronunciator - 175,000 entries fully InternationalPhonetic Alphabet coded

Moby Thesaurus - 30,000 root words, 2.5 million synonymsand related words

Moby Words - 610,000+ words and phrases




FacebookR package




Linked & Structured Data

Using RDF format.

DBPedia is a project aiming to extract structured contentfrom the information created as part of Wikipedia project

FreeBase is a large collaborative knowledge base consisting ofmetadata composed mainly by its community members

BabelNet is a multilingual lexicalized semantic network andontology. Automatically created using Wikipedia.

YAGO is a knowledge base developed at the Max PlanckInstitute. Also automatically built.




FacebookR package




Spoken corpus

TalkBank(multilingual): �rst language acquisition, secondlanguage acquisition, conversation analysis, classroomdiscourse, and aphasic language.

CHILDES(part of TalkBank): Child Language Data ExchangeSystem




FacebookR package




Sentiment data

SentiWordNet

Dictionary by Warriner et al.

Dictionary by Hu and Liu




FacebookR package



Social media

Rating systems: IMDB, Amazon, TripAdvisor, OpenTable

Sentiment: ExperienceProject, FMyLife, MyLifeIsAverage

Facebook (OpenGraph)

Twitter

Blogs (LiveJournal, Blogger, etc.)




FacebookR package


Possible ways to get the data

Corpora: just download it!

Facebook, Twitter and other social media: use API

Blogs, Internet data: parse HTML or XML (download webpageusing wget/curl)

Linked data: parse RDF




FacebookR package


Don'n forget this step!

Tokenization

Remove punctuation, may be number and stop words,lower-case everything

Lemmatization or stemming(Porter, Snowball)

In case of bag-of-words you maycreate term x document or term x term matrix(using TF,TFIDF, RIDF for normalization)




FacebookR package


Few key words from data mining

Compute set similarity: Jaccard, Dice, F-scores

Transform words to vectors: LSA, MDS

Get topics of documents: LDA

For classi�cation you may use: SVM, neural networks,discriminant analysis, bayesian networks, decision trees,random forest,adaboost

For clustering you may use: k-means, knn, SOM, SVM

For regression you may use: SVM, neural networks, GLM, NLS




FacebookR package


Connect to Facebook OpenGraph

Get access token

Go tohttps:

//developers.facebook.com/tools/access_token/

Check it works:https://developers.facebook.com/tools/explorer?

method=GET&path=me%3Ffields%3Did%2Cnameme?fields=

id,name,gender

Use tutorial:https://developers.facebook.com/docs/graph-api/

common-scenarios/


https://developers.facebook.com/tools/access_token/

https://developers.facebook.com/tools/access_token/

https://developers.facebook.com/tools/explorer?method=GET&path=me%3Ffields%3Did%2Cnameme?fields=id,name,gender



https://developers.facebook.com/docs/graph-api/common-scenarios/

https://developers.facebook.com/docs/graph-api/common-scenarios/



FacebookR package


Facebook & Python

Download the package:https://github.com/pythonforfacebook/facebook-sdk

Install it : python setup.py install


https://github.com/pythonforfacebook/facebook-sdk



FacebookR package


Facebook & Python

Get names and gender of your friends. Possible project: predictionof gender according to the names

import facebook

token='your_token '

graph = facebook.GraphAPI(token)

profile = graph.get_object("me")

friends = graph.get_connections("me", "friends")

friend_list = [friend['id'] for friend in friends['data']]

for friend_id in friend_list:

data=graph.get_object(friend_id)

if 'gender ' in data.keys():

print data['name'], data['gender ']




FacebookR package


Using R

Packages you may need

tm - text mining + tm.plugin.webmining for webcorpora, htmlparsers, plain text extraction

topicmodels - topicality

wordcloud - create a cloud of words

qdap - sentiment analysis

RCurl - curl (get the contents of a webpage)

twitteR - to use data from twitter

Wordnet - wordnet usage (dictionary needed)

e1071 - machine learning(clustering, SVM, naive Bayes, LSA)




FacebookR package


Packages usage

Installation: install.packages(name)

Usage: library(name)




FacebookR package


Twitter with R

Load packages:

library(twitteR)

library(tm)

library(RCurl)

library(qdap)

library(wordcloud)




FacebookR package


Twitter with R

Get Token:

reqURL <- "https://api.twitter.com/oauth/request_token"

accessURL <- "https://api.twitter.com/oauth/access_token"

authURL <- "https://api.twitter.com/oauth/authorize"

consumerKey <- "key"

consumerSecret <- "secret"

twitCred <- OAuthFactory$new(consumerKey=consumerKey ,

consumerSecret=consumerSecret ,

requestURL=reqURL ,

accessURL=accessURL ,

authURL=authURL)

# The method will return a link to get a PIN code , you

should enter the code

twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.

pem",

package = "RCurl"))

registerTwitterOAuth(twitCred)




FacebookR package


Twitter with R

Get the data and convert to corpus:

# search by hashtag , you may also search by plain words. Get

n=1000 entries

gglTweets <- searchTwitter('#sochi2014 ', n=1000)

n <- length(gglTweets)

# show first 3 entries

gglTweets [1:3]

# put it in a data frame

df <- do.call("rbind",

lapply(gglTweets , as.data.frame))

# get dimenstionality

dim(df)

# create a corpus

myCorpus <- Corpus(VectorSource(df$text))




FacebookR package


Twitter with R

Do normalization:

myCorpus <- tm_map(myCorpus , tolower)

# remove punctuation

myCorpus <- tm_map(myCorpus , removePunctuation)

# remove numbers

myCorpus <- tm_map(myCorpus , removeNumbers)

# remove stopwords (very frequent words , e.g. articles ,

prepositions)

myStopwords <- c(stopwords('english ')), "sochi","amp", "get"

)

myCorpus <- tm_map(myCorpus , removeWords , myStopwords)




FacebookR package


Twitter with R

Stem the documents:

dictCorpus <- myCorpus

# apply stemming for normalization , you may use

lemmatization instead

myCorpus <- tm_map(myCorpus , stemDocument)

inspect(myCorpus [1:3])

myCorpus <- tm_map(myCorpus ,

stemCompletion , dictionary=dictCorpus)

inspect(myCorpus [1:3])




FacebookR package


Twitter with R

Create TDM:

# create term -document matrix , you may use TF or TFIDF

metric

myDtm <- TermDocumentMatrix(myCorpus , control =

list(minWordLength = 1,

weighting = weightTfIdf))

inspect(myDtm [66:70 ,11:20])

# frequent terms and associations

findFreqTerms(myDtm , lowfreq =10)




FacebookR package


Twitter with R

Create a wordcloud:

# convert TDM to plain matrix

m<-as.matrix(myDtm)

# sort by decreasing frequencies

v<-sort(rowSums(m),decreasing=TRUE)

# show first 14 entries

head(v,14)

# get colnames

words <-names(v)

# create dataframe for words with frequencies

dat <-data.frame(word=words ,freq=v)

# create wordcloud from words which appeared at least 5

times

wordcloud(dat$word ,dat$freq , min.freq =5)




FacebookR package


Experience projectExperience projectIMDB: Vector space models

Experience Project is a free social networking website consisting ofvarious online communities. Users/members submit"experiences personal stories, confessions, blogs, groups, photos,and videos.The users assign categories to the stories.

Example: I really hate being shy ... I just want to be able to talk tosomeone about anything and everything and be myself ... That's allI've ever wanted.

Reactions: hugs: 1; rock: 1; teehee: 2; understand: 10; wow: 0;

Author age: 21

Author gender:female

Text group: friends




FacebookR package



Data

Let's load the data:

# read .cvs file with data

ep = read.csv('ep3 -context.csv')

Here: Count is the number of Category reactions received byconfessions containing Word in Group with an author of Genderand Age.Total is the number of Category reactions used by confessionscontaining any Word in Group with an author of Gender and Age.




FacebookR package



Data

Look at di�erent parameters:

# show examples of words

levels(ep$Word)




FacebookR package



Words and categories

Word-Category Correlation

Check if there is any correlation between words and categories

# include source file

source('ep.R')

# create a subset for word "funny"

funny = epCollapsedFrame(ep, 'funny')

# plot the frequencies of the word for each category

plot(funny$Category , funny$Count , xlab='Category ', ylab='

Count', main='funny')




FacebookR package





"Funny"corresponds to "understand"category. This doesn't lookrealistically..




FacebookR package





We need normalization!

# apply normalization (divide by the total number of words)

funny$Count / funny$Total

# get a subset for "funny", take frequencies into account

funny = epCollapsedFrame(ep, 'funny', freqs=TRUE)

# create a plot

plot(funny$Category , funny$Freq , xlab='Category ', ylab='

Count/Total', main='funny')




FacebookR package





Much better!Ekaterina Vylomova Working with linguistic data



FacebookR package



Probability theory

Get category from word

Freq corresponds to the conditional probability P(word|category),i.e. the probability to that a speaker used 'word' in a given'category'.Let's apply Bayesian rule and compute P(category|word), i.e. theprobability of category given that a speaker used 'word'.

funny$Freq / sum(funny$Freq)

funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE

)

plot(funny$Category , funny$Pr , xlab='Category ', ylab='(Count

/Total)/sum(Count/Total)', main='funny')

Question: any other words speci�c for a category?




FacebookR package



Compare with estimated value

Estimate expected value

funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE

, oe=TRUE)

Estimated value: Exp =∑N

i=1xip(xi ), p(xi ) is a probability of xi .

category.probs = (funny$Total/sum(funny$Total))

funny.count = sum(funny$Count)

funny.expected = funny.count * category.probs

funny.expected

Compare estimated and observed values:

(funny$observed / funny.expected) - 1

Value less than 0 means that a word is underrepresented in acategory.




FacebookR package



Adding context: 'awesome' by gender

Usage of 'awesome' by male/female/unknown

eptok = read.csv('ep3 -context -tokencounts.csv')

par(mfrow=c(1,3))

epPlot(ep , eptok , 'awesome ', genders='male', probs=T)

epPlot(ep , eptok , 'awesome ', genders='female ', probs=T)

epPlot(ep , eptok , 'awesome ', genders='unknown ', probs=T)




FacebookR package



Adding context: 'awesome' by gender

Usage of 'awesome' by male/female/unknown




FacebookR package



Adding context: 'awesome' by age

Usage of 'awesome' by people of di�erent ages

par(mfrow=c(2,3))

for (i in 1:5) { epPlot(ep, eptok , 'awesome ', ages=i, probs=

T) }




FacebookR package



Adding context: 'awesome' by age

Usage of 'awesome' by people of di�erent ages




FacebookR package



Adding context: 'awesome' comparing gender with the

category

'Awesome': gender+category

Changing the parameter for each category separately:

epCategoryByFactorPlot(ep, eptok , 'awesome ', 'Gender ', probs

=T, type='b')




FacebookR package



Adding context: 'awesome' comparing gender with the

category

'Awesome': gender+category




FacebookR package



Adding context: 'drunk' comparing gender with the category

'Drunk': gender+category

Stories with "drunk"depend on the age:

epCategoryByFactorPlot(ep, eptok , 'drunk', 'Age', probs=T,

type='b')




FacebookR package



Adding context: 'drunk' comparing gender with the category

'Drunk': gender+category




FacebookR package



Creating a logistic regression model

Regression modelling

Let's create a regression model: predict the frequency of 'drunk'using age and category

drunk = epFullFrame(ep, 'drunk', age=c(1,2,3,4,5))

drunk$Age = as.numeric(drunk$Age)

fit.glm = glm(cbind(Count ,Total -Count) ~ Category - 1 + Age ,

data=drunk , family=binomial)

summary(fit.glm)




FacebookR package





Find a function that predicts a word according to the category andage of person

FittedGlmFunc = function(fit , category , age) {

coefs = fit$coef

cat.coef = coefs[[ paste('Category ',category , sep='')]]

prediction = plogis(cat.coef + coefs [['Age']]*age)

return(prediction)

}

Calling the function:

FittedGlmFunc(fit.glm , 'wow', 1)




FacebookR package





Visualize the data and compare empirical(black) values with�tted(red) data.

par(mfrow=c(2,3))

cats = levels(ep$Category)

for(i in 1:5) {

epPlot(ep , eptok , 'drunk', age=i)

for (j in 1:5) {

val = FittedGlmFunc(fit.glm , cats[j], i)

points(j, val , col='red', pch =19)

}

}




FacebookR package



Calculating expected value


Visualize the data and compare empirical(black) values with�tted(red) data.




FacebookR package



IMDB data

Analysis of "ADV-ADJ"collocations




FacebookR package



Data from rating systems

Data

We will use the data from rating systems(Amazon.com,OpenTable.com, Goodreads.com, IMDB.com). Load them:

d = read.csv('ratings -advadj.csv')

head(d)




FacebookR package



Extract subsets

'Horrid' categories

horrid = ratingFullFrame(d, 'horrid ', types=NULL , modifiers=

NULL , modifier.types=NULL , ratingmax =0)

nrow(horrid)

head(horrid)




FacebookR package



Extract subsets

'Absolutely'+'Horrid'

With modi�er:

horrid = ratingFullFrame(d, 'horrid ', modifiers='absolutely '

)

nrow(horrid)

head(horrid)




FacebookR package



Tonality evaluation for adjectives

Probabilities of categories for 'horrid'

horrid = ratingCollapsedFrame(d, 'horrid ', freqs=TRUE , probs

=TRUE)

horrid




FacebookR package



Tonality

Probabilities vs frequencies

par(mfrow=c(1,2))

ratingPlot(d, 'horrid ', probs=FALSE)

ratingPlot(d, 'horrid ', probs=TRUE)

Question: give an example of adjective which maximizes the medianpoint of the plot.




FacebookR package



Evaluating expectation

Predict category using adjective

Predict a category based on adjective.Expectation:

sum(horrid$Category * horrid$Pr)

The same does ExpectedCategory function:

ExpectedCategory(horrid)

Adding value to the plot:

ratingPlot(d, 'horrid ', probs=TRUE , ec=TRUE)




FacebookR package



Evaluating expectation

Predict category using adjective




FacebookR package



Regression model

A model for predicting

Let's create a model to predict probability that a word will be inparticular category

fit.horrid = glm(cbind(horrid$Count , horrid$Total -horrid$

Count) ~ Category , family=quasibinomial , data=horrid)

fit.horrid




FacebookR package



Regression model





FacebookR package



Regression model


Improve the model by adding quadratic function

GlmWordQuadratic <-function(pf) {

pf$Category2 = pf$Category ^2

fit = glm(cbind(Count ,Total -Count) ~ Category + Category2 ,

family=quasibinomial , data=pf)

return(fit)

}

par(mfrow=c(2,2))

ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic)

, ratingmax=5, ylim=c(0, 0.5))

ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic)

, ratingmax =10, ylim=c(0, 0.3))

ratingPlot(d, 'disappointing ', probs=TRUE , models=c(

GlmWordQuadratic), ratingmax=5, ylim=c(0, 0.5))

ratingPlot(d, 'disappointing ', probs=TRUE , models=c(

GlmWordQuadratic), ratingmax =10, ylim=c(0, 0.3))Ekaterina Vylomova Working with linguistic data



FacebookR package



Regression model





FacebookR package



Vector space models

Vector space models

How to transform words to vectors:

LSA (latent semantic analysis)

MDS (multidimensional scaling)




FacebookR package



Basics about vectors

Euclidean distance:

EuclideanDist(x , y) =

√√√√ n∑i=1

(xi − yi )2

Vector length:

VectorLength(x) =

√√√√ n∑i=1

(xi )2

Vector normalization - component divided by its length.Cosine between vectors:

CosineDist(x , y) = 1−∑n

i=1(xi ) ∗

∑ni=1

(yi )

VectorLength(x) ∗ VectorLength(y)Ekaterina Vylomova Working with linguistic data



FacebookR package



Vector space models

Data from IMDB

Initail data: term x term matrix, xij element of matrix is afrequency of cooccurrence of termi and termj in context(document,sentences, etc.)

source('vsm.R')

# co-occurrence matrix(words appearing in the same context(

phrase , sentence , paragraph))

imdb = Csv2Matrix('imdb -wordword.csv')

imdb [100:110 , 100:110]




FacebookR package



Semantically related words

Extract semantically related words

df = Neighbors(imdb , 'happy')

head(df)




FacebookR package



Semantically related words

Problem

a = c(1000 , 2000, 3000)

b = c(1, 2, 3)

a/sum(a)

# 0.1666667 0.3333333 0.5000000

b/sum(b)

# 0.1666667 0.3333333 0.5000000

LengthNorm(a)

# 0.2672612 0.5345225 0.8017837

LengthNorm(b)

> [1] 0.2672612 0.5345225 0.801783




FacebookR package



PMI - Pointwise mutual information

How to deal with it? - PMI!

PMI (x , y) = logp(x , y)

p(x) ∗ p(y)PMI normalization:

NPMI (i , j) = pmi(i , j)∗ p(i , j)

p(i , j) + 1∗

min (∑m

k=1p(k , j),

∑nk=1

p(k , j))

min (∑m

k=1p(k , j),

∑nk=1

p(k , j)) + 1

Where p(i,j)=M/sum(M), M - term x term matrix




FacebookR package



PMI - Pointwise mutual information

PMI

imdb.ppcd = PMI(imdb , positive=TRUE , discounting=TRUE)

df = Neighbors(imdb.ppcd , 'happy', byrow=TRUE , distfunc=

CosineDistance)

head(df)




FacebookR package



Semantic orientation method

Semantic orientation

Describe 2 sets of words S1 è S2 (vector representations)

Choose the distance measure

For a word w : calculate the sum of distances to vectors of S1and S2

The tonality is a di�erence between two sums




FacebookR package



Semantic orientation method

Example of semantic orientation method

neg = c('bad', 'nasty ', 'poor', 'negative ', 'unfortunate ', '

wrong', 'inferior ')

pos = c('good', 'nice', 'excellent ', 'positive ', 'fortunate '

, 'correct ', 'superior ')

SemanticOrientation(imdb.ppcd , word='great ', seeds1=neg ,

seeds2=pos , distfunc=CosineDistance)

# 0.8923544

SemanticOrientation(imdb.ppci , word='horrid ', seeds1=neg ,

seeds2=pos , distfunc=CosineDistance)

# -0.04741898




FacebookR package



More information

Data & examples

For more detailed examples and tutorials about sentiment analysisgo to Chris Potts tutorials.http://nasslli2012.christopherpotts.net

http://sentiment.christopherpotts.net

Email me if you need any help!


http://nasslli2012.christopherpotts.net

http://sentiment.christopherpotts.net

working with text data

Data & Analytics

data corpora

internet data

data mining compute

key words

chris potts tutorial

liu ekaterina vylomova

nls ekaterina vylomova

phrases ekaterina vylomova