language of politics on twitter - 03 analysis

63
Language of Politics on Twitter Summer School in AI American University Beirut June 16, 2015 Yelena Mejova @yelenamm Social Computing Group Qatar Computing Research Institute, HBKU

Upload: yelena-mejova

Post on 06-Aug-2015

39 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Language of Politics on TwitterSummer School in AI

American University BeirutJune 16, 2015

Yelena Mejova@yelenammSocial Computing GroupQatar Computing Research Institute, HBKU

political

twitter

analysis

Roadmap

• lets talk politics (sampling)• political leaning– human classification– text-based classification– network-based classification

• look who’s talking (users)• predicting elections!

US politics

• Most research done so far• Clear left/right distinction• Popular political figures• High(ish) Twitter engagement REPUBLICAN

(right)DEMOCRAT

(left)

• Sampling Twitter for political speech– general keywords: #current– event keywords: #debate08, #tweetdebate– people: obama, romney, merkel– parties: democrat, republican, pirate– accounts: wefollow, twellow– news stories, known URL retweets

• Caveats– requires expert knowledge– known best after the event– selection bias (who do you want to ignore?)

topical sampling

bootstrapping

1. start with a few key words2. find tweets that have these words3. get more words out of these tweets

• seed sample with known political hashtags– #p2 – Progressives 2.0– #tcot – Top Conservatives on Twitter

• find hashtags which co-occurred with them, using Jaccard similarity

bootstrapping

tweets mentioning both

tweets mentioning either

bootstrapping

Predicting the political alignment of twitter users @vagabondjack Conover et al. @ SocialCom (2011)

got your #tag!

hashtag week party

aggregated user volume for (h,w)aggregated user volume for (*,w)• Given set of users with known leaning:

Political hashtag hijacking in the US Hadgu, Garimella, Weber @ WWW (2013)

[some figures from authors’ original slides]

Crimean conflict

Крымcomparing tweets by users withUkrainian or Russian as profile language

most distinguishing hashtags

Language Plurality in Twitter Political Speech Mejova, Boynton @ ICCSS (2015)

1. Crowdsourcing2. Text (text classification)3. Network (label propagation)

political leaning classification

human classificationcrowdsourcing

mechanical turk

crowdsourcing

• break the task into micro-tasks (N/Y question)• have many people answer for a bit of money• wisdom of crowds will give the right answer

crowdsourcing

text classification

Representing Text

• “Bag of words”, i.e. Vector Space Model

break the document into its constituent words and put them in a table

Representing Text

• Preprocessing– Clean-up• remove formatting, tables, HTML…

– Remove stopwords• the, of, to, a, in, and, that, for, is

– Stem words• get to a “stem” of a word• cats -> cat, running -> run, uncomfortable -> uncomfort?

Representing Text

• Vector Space Model:

those lazy cats sleep and sleep everywhere

D = (t1, wd1; t2, wd2; …, tv, wdv)

w: binary, count, TFIDF

lazy cat sleep everywhere …

1 1 2 1 …

TFIDFterm frequency – inverse document frequency

Problems

• Synonymy– multiple words that have similar meanings

• Polysemy– words that have more than one meaning

EYE DROPS OFF SHELFPROSTITUTES APPEAL TO POPE

KIDS MAKE NUTRITIOUS SNACKSSTOLEN PAINTING FOUND BY TREE

LUNG CANCER IN WOMEN MUSHROOMSQUEEN MARY HAVING BOTTOM SCRAPEDDEALERS WILL HEAR CAR TALK AT NOONMINERS REFUSE TO WORK AFTER DEATH

MILK DRINKERS ARE TURNING TO POWDERDRUNK GETS NINE MONTHS IN VIOLIN CASE

GRANDMOTHER OF EIGHT MAKES HOLE IN ONEHOSPITALS ARE SUED BY 7 FOOT DOCTORS

LAWMEN FROM MEXICO BARBECUE GUESTSTWO SOVIET SHIPS COLLIDE, ONE DIES

ENRAGED COW INJURES FARMER WITH AXLACK OF BRAINS HINDERS RESEARCH

RED TAPE HOLDS UP NEW BRIDGESQUAD HELPS DOG BITE VICTIM

IRAQI HEAD SEEKS ARMSHERSHEY BARS PROTEST

text classification

classifier

documentlabel

• is it spam?• is it important?• is it happy?• is it true?• is it a flight ticket?

classifier

documentlabel

• is it written well?• is it about politics?• is it a bully?• is it fake?• is it a joke?

classifiers

naïve bayesdecision trees

support vector machineslogistic regression

perceptronneural networks

k-nearest neighbor

naïve bayes classifier

• We want to know probability of a class given an instance represented by a feature vector. By Bayes’ Theorem:

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

constant no matter C

joint probability

naïve bayes classifier

• Expand the joint probability using the chain rule

• But to simplify, we use a naïve assumption of conditional independence for each feature

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

naïve bayes classifier

• Finally, the conditional distribution over class C

scaling factor

probability of class C given a document with some features

prior of the class

frequency based probability of features in that class C

support vector machine

• Finds a hyperplane in high-dimensional space that maximizes the distance to the nearest training point of any class

https://en.wikipedia.org/wiki/Support_vector_machine

political leaning classification

Predicting the political alignment of twitter users @vagabondjack Conover,

Gonçalves, Ratkiewicz, Flammini, Menczer @

SocialCom (2011)

Is a user politically left or right?

actual classAB

predicted classA Bconfusion matrix

Classifier: Support Vector Machine

network-based classification

network

adjacency matrix

0 10 41 21 31 42 33 4

adjacency list

network label propagation

at each step update each node’s label based on its neighbors

• Label propagation– Initialize cluster

membership arbitrarily– Iteratively update each

node’s label according to the majority of its neighbors

– Ties are broken randomly• Cluster assignment by

majority cluster label (using manually labeled data)

political leaning classification

retweet network

Twitter polarity classification with label propagation over lexical links and the follower graph

@speriosu Speriosu, Sudan, Upadhyay, Baldridge @ EMNLP (2011)

political leaning classificationkn

own

know

n

automaticallylabeled

bonus: news are users too!

news polarizationVisualizing media bias through Twitter

@JisunAn An, Cha, Gummadi, Crowcroft, Quercia @ AAAI (2012)

Jaccard similarity of their audience (co-subscribers)

distance between two media

overlap in common audience (followers on Twitter)

look who’s talking

look who’s talkingVocal Minority versus Silent Majority:

Discovering the Opinions of the Long Tail @enimust Mustafaraj, Finn, Whitlock, Metaxas @ SocialCom (2011)

number of tweets per user

look who’s talking

Vocal Minority versus Silent Majority: Discovering the Opinions of the Long Tail

@enimust Mustafaraj, Finn, Whitlock, Metaxas @ SocialCom (2011)

GOP primary season on twitter: popular political sentiment in social media @yelenamm Mejova, Srinivasan, Boynton @ WSDM (2013)

look who’s talking

• Truthiness is a quality characterizing a "truth" that a person making an argument or assertion claims to know intuitively "from the gut" or because it "feels right" without regard to evidence, logic, intellectual examination, or facts.

Detecting and Tracking Political Abuse in Social Media Ratkiewicz, Conover, Meiss, Goncalves, Flammini, Menczer @ ICWSM (2011)

look who’s talking

Classifying memes (hashtags) for astroturf (fake grass roots movements)

Detecting and Tracking Political Abuse in Social Media Ratkiewicz, Conover, Meiss, Goncalves, Flammini, Menczer @ ICWSM (2011)

look who’s talking

most useful:network features

Truthy project by Indiana Universityhttp://truthy.indiana.edu/

look who’s talking

look who’s talking

#ampat @PeaceKaren_25 &@HopeMarie_25

gopleader.gov Chris Coons

#Truthy @senjohnmccain on.cnn.com/aVMu5y “Obama said…”

TRU

THY

LEG

ITIM

ATE

elections

Science vol 338

sentiment classification

classifier

tweeton a topic

positive vs negative

Trained Classifiers Sentiment Lexiconscan “tune” for specific topic and data

but expensivecan use “out of the box”

but may not work for every topic

political discussions: debates

• Mean valence:– Obama: -2.09– McCain: -5.64

Characterizing Debate Performance via Aggregated Twitter Sentiment @ndiakopoulos Diakopoulos, Shamma

@ CHI (2010)

an emotional story

volume positive - negative

• 2009 German federal elections

electionsPredicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment

Tumasjan, Sprenger, Sandner, Welpe @ AAAI (2010)

“The mere number of tweets reflects voter preferences and comes close to traditional election polls”

CONTROVERSY!

electionsWhy the Pirate Party won the German election of 2009 or the trouble with predictions: A

response to Tumasjan, Sprenger, Sander, & Welpe, "Predicting elections with twitter: What 140 characters reveal about political sentiment"

@ajungherr Jungherr, Jürgens, Schoen @ SSCR V30/N2 (2012)

“arbitrary choices”

If results of polls played a role in deciding upon the inclusion of particular parties, the TSSW method is dependent

on public opinion surveys

Choice of Parties Choice of Dates

prediction analysis […] between [13.9] and [27.9], the day of the election,

produces a MAE of of 2.13, significantly higher than the MAE for TSSW

• 2012 US Republican Primary Debates• Predicting polls swings around televised debates:

– 104 predictions overall

electionsGOP primary season on twitter: popular political sentiment in social media

@yelenamm Mejova, Srinivasan, Boynton @ WSDM (2013)

Both volume or sentiment classification are same than random

elections

single variable logistic regression models multi-variable logistic regression models

strong baselines!having followers (in your own party?)

focusing on centrist issues

graph structure and content significantly improve accuracy

The Party Is Over Here: Structure and Content in the 2010 Election Livne, Simmons, Adar, Adamic @ ICWSM (2011)

• Non-US elections:

– Irish: On using twitter to monitor political sentiment and predict election results, Bermingham, Smeaton (2011)• "Our approach however has demonstrated an error which is not competitive

with the traditional polling methods.”

– Dutch: Predicting the 2011 Dutch senate election results with twitter, Sang, Bos (2012)• Uses polls for demographic imbalances, yet performance still below

traditional polls

– Singapore: Tweets and votes: A study of the 2011 singapore general election, Skoric, Poor, Achananuparp, Lim, Jiang (2012)• Not as accurate as traditional polls, performance at local government levels

– many more coming out each day!

elections

Metaxas et al. @ SocialCom (2011)

• Data from social media are fundamentally different than data from natural phenomena– people change their behavior next time around– spammers & activists will try to take advantage

• From a testable theory on why and when it predicts (avoid self-deception!)

• (maybe) Learn from professional pollsters– tweet ≠ user– user ≠ eligible voter– eligible voter ≠ voter

How (Not) To Predict Elections @takis_metaxas Metaxas et al. @ SocialCom (2011)

elections

but what can we do?

help campaigners reach more peoplepredict people’s political leaning

help understand reasons for affiliationrecommend politicians, news, friends

detect sudden strong sentiment about a topicdetect polarization (users & news)

views of issues from around the world

light summer reading

• M. D. Conover, B. Gonçalves, J. Ratkiewicz, A. Flammini, and F. Menczer, “Predicting the political alignment of twitter users,” in Privacy, security, risk and trust (passat), 2011 IEEE Third International Conference on Social Computing (SocialCom), 2011, pp. 192–199.

• M. D. Conover, J. Ratkiewicz, M. Francisco, B. Goncalves, F. Menczer, and A. Flammini, “Political Polarization on Twitter,” International Conference on Weblogs and Social Media (ICWSM), 2011.

• M. Speriosu, N. Sudan, S. Upadhyay, and J. Baldridge, “Twitter polarity classification with label propagation over lexical links and the follower graph,” in Proceedings of the First workshop on Unsupervised Learning in NLP, 2011, pp. 53–63.

• I. Weber, V. R. K. Garimella, and A. Teka, “Political hashtag trends,” in in Advances in Information Retrieval, Springer, 2013, pp. 857–860.

• A. T. Hadgu, K. Garimella, and I. Weber, “Political hashtag hijacking in the US,” in Proceedings of the 22nd international conference on World Wide Web companion, 2013, pp. 55–56.

• M. Pennacchiotti and A.-M. Popescu, “Democrats, republicans and starbucks afficionados: user classification in twitter,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011, pp. 430–438.

• N. A. Diakopoulos and D. A. Shamma, “Characterizing Debate Performance via Aggregated Twitter Sentiment,” Conference on Human Factors in Computing Systems (CHI), 2010.

• L. Chen, W. Wang, and A. P. Sheth, “Are twitter users equal in predicting elections? a study of user groups in predicting 2012 US republican presidential primaries,” in Social Informatics, Springer, 2012, pp. 379–392.

• J. An, M. Cha, K. P. Gummadi, J. Crowcroft, and D. Quercia, “Visualizing media bias through Twitter,” Association for the Advancement of Artificial Intelligence (AAAI), Technical WS-12-11, 2012.

• E. Mustafaraj, S. Finn, C. Whitlock, and P. T. Metaxas, “Vocal Minority versus Silent Majority: Discovering the Opinions of the Long Tail,” in International Conference on Social Computing, 2011, pp. 103–110.

• J. Ratkiewicz, M. D. Conover, M. Meiss, B. Goncalves, A. Flammini, and F. M. Menczer, “Detecting and Tracking Political Abuse in Social Media,” International Conference on Weblogs and Social Media (ICWSM), 2011.

• A. Livne, M. Simmons, E. Adar, and L. Adamic, “The Party Is Over Here: Structure and Content in the 2010 Election,” International Conference on Weblogs and Social Media (ICWSM), 2011.

• A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe, “Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment,” Association for the Advancement of Artificial Intelligence Conference (AAAI), 2010.

• P. Metaxas, E. Mustafaraj, and D. Gayo-Avello, “How (Not) To Predict Elections,” International Conference on Social Computing, 2011.

• A. Jungherr, P. Jürgens, and H. Schoen, “Why the pirate party won the german election of 2009 or the trouble with predictions: A response to Tumasjan, a., Sprenger, to, Sander, pg, & Welpe, im ‘predicting elections with twitter: What 140 characters reveal about political sentiment’,” Social Science Computer Review, vol. 30, no. 2, pp. 229–234, 2012.

• I. Weber, V. R. K. Garimella, and A. Batayneh, “Secular vs. Islamist polarization in Egypt on Twitter.” ASONAM, 2013.

Surveys• D. Gayo-Avello, “‘ I Wanted to Predict Elections with Twitter and all I got was this

Lousy Paper’--A Balanced Survey on Election Prediction using Twitter Data,” arXiv preprint arXiv:1204.6441, 2012.

• D. Gayo-Avello, “A meta-analysis of state-of-the-art electoral prediction from Twitter data,” Social Science Computer Review, 2013.