mining and analyzing social media part 1 - hicss47 tutorial - dave king

86
Mining and Analyzing Social Media: Part 1 Dave King HICSS 47 - January 2014

Upload: dave-king

Post on 12-May-2015

606 views

Category:

Social Media


3 download

DESCRIPTION

Part 1 of a Tutorial on Mining and Analyzing Social Media at HICSS 47

TRANSCRIPT

Page 1: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Mining and Analyzing Social Media: Part 1Dave King

HICSS 47 - January 2014

Page 2: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

• Introduction– Bio– Some definitions– Growing Interest– Resources

• Social Homophily as an Example– Meaning and Import– Political Cyberbalkanization

• Social Network Analysis• Text Mining

Agenda: Part 1

2

Page 3: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

• Introduction to Social Network Analysis Metrics– Degrees of Separation– Hooray for Bollywood

• Standard Measures– Centrality– Cohesion

• Levels of Social Network Analysis – Facebook & Egocentric Analysis– Searching for Cohesive Subgroups

Agenda: Part 2

3

Page 4: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

4

BiographyDave King

UniversityProfessor

Dir R&DExecucom

SVP & CTOComshare

EVP R&DJDA Software

App StatsMath Modeling

DSSOptimization

App Stats

Nat Lang forDB Query

DSS InterfaceHICSS (83)

‘70 ‘80 ‘90 ‘00 ‘10

DSSMulti-Dim DB

EIS & BI

Merch PlanningSCP PlanningOptimization

EComHICSS

Track Intern& Dig Econ

HICSSTutorial

Web Mining &Analysis

SentimentAnalysis

Page 5: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Social Media Defined

5

Page 6: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

View from the PinwheelSocial Media

Page 7: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

7

View from the Word CloudsSocial Media

Page 8: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Defined

8

is the media we use to be social. That’s it.

Page 9: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Two ApproachesMining and Analyzing Social Media

9

Connections

Content

Page 10: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

ContentSocial Media

Blogs

Tags

Emails

Reviews

Recommendations

Tweets SMS

IM Chat

Ratings

Profiles

Bookmarks Photos

Articles

Presentations

Wiki Entries

News Stories

Conversations

Discussion

Lifestreams

Music

RSS Feeds

FAQs Video

Location

Page 11: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

11

Social ConnectionsSocial Media

Friends

Followers

Authors

Readers

Family

Collaborators

Participants

Senders

Recipients

Groups

Contacts

Subscribers Collaborators

Answerers

Page 12: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

ConnectionsSocial Media

ObjectsPeople

Places

Page 13: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Types of Mining and Analysis

13

Page 14: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

The ContentMining and Analyzing Social Media

Web MiningData Mining focused on the

analysis of Web Usage, Structure & Content

Data MiningDiscovering meaningful patterns

from large data sets using pattern recognition technologies

Text MiningUsing natural language processing & data mining to discover patterns

in a collection of “documents

Page 15: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

15

The ConnectionsMining and Analyzing Social Media

Social Network Analysis [SNA]

The mapping and measuring of relationships and flows between people, groups, organizations, computers,

URLs, and other connected information/knowledge

entities.

Network science Study of the theoretical foundations of network

structure and behavior and the application of networks to

many subfields including SNA, collaboration networks,

emergent systems, and physical and life science

systems

Page 16: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

InterestYou are not alone or maybe you are?

16

Page 17: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

17

Search - Google TrendsInterest

Data Mining

Big Data

Web MiningText Mining

Web Mining

Social Network Analysis

Network Science

Sentiment Analysis

Social Media Mining

Data Analytics Mathematical Statistics

Page 18: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Book Titles – Google Ngram ViewerInterest

18

Mathematical Statistics

Social Network Analysis

Network Science

Data Mining

Data Analytics

Big Data

Text Mining

Web Mining

Sentiment Analysis

Social Media**

Page 19: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

ResourcesCustomers who spent a lot of money on these items also…

19

Page 20: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

General Coverage and ProgrammingWeb & Text Mining Resources

20

Page 21: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Computational Models and ProgrammingText Mining & Social Network Analysis

21

Page 22: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

General Coverage of SNASocial Network Analysis

Page 23: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Mathematical Models and SNA ToolsSocial Network Analysis

UCI

UCI UCIPajek ORAUCI,Pajek

& ORA NodeXL

Page 24: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Online Tutorials and ClassesSocial Network Analysis

Page 25: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

An ExampleSocial Homophily from Two Perspectives

25

Page 26: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

26

aka Assortative MixingSocial Homophily

Sound bitesLove of the sameBirds of a feather flock togetherLikes associate with likesLikes copy likesLikes think alike

Page 27: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

27

Some things that are homogenousSocial Homophily

Page 28: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Key ElementsSocial Network Analysis

A

B

C

D

Graph[ V,E, f ]

Vertex(Node)

Edge(Link)

Vertices or Nodes The “things”

Edges or LinksThe “relationships”

Graph or NetworkThe set of vertices/nodes, edges/links and the relationship/function connecting them.

Page 29: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

29

Types of Vertices

• Social Entities– People or social structures such as

workgroups, teams, organizations, institutions, states, or even countries.

• Content– Web pages, keyword tags, or videos.

• Locations– Physical or virtual locations or events.

• Primary building blocks of social media– Friends in social networking sites,

posts or authors in blogs, or pages in wikis.

Social Network Analysis

A

B

C

D

What does a Vertex

represent?

Page 30: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

30

Types of Relations

• Similarities– Location (same spatial and temporal space)– Participation (same club, same event, …)– Attributes (Age, gender, same attitudes, …)

• Relational Roles– Kinship (mother of, sibling of, …)– Other Roles (friend of, boss of, …)

• Relational Cognition– Affective (Likes, Hates…)– Perceptual (Knows, Knows of, …)

• Relational Events– Interactions (Sold to, talked to, helped, …)– Flows (Information, beliefs, money, …)

Social Network Analysis

A

B

C

D

What does an Edge

represent?

Page 31: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Types of Edges or LinksSocial Network Analysis

31

Undirected,Unweighted

A B

C

Facebook Friends

Undirected,Weighted

A B

C

CollaborationNetwork

A B

C

TwitterFollowers

Directed,Unweighted

A B

C

EmailNetwork

Directed, Weighted100

6070

20

5

10Width oflines

Page 32: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Alternative RepresentationsSocial Network Analysis

32

Graph Edge List Adjacency Matrix

Adjacency List XML

A

B

CD

Page 33: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

33

Multimodal networksSocial Network Analysis

I1 I2 I3 I4

E1 E2 E3 E4

Bipartite or Bimodal NetworkLinking individuals to events(participation, membership, topic, tag, post, …)

Examples:Authors‐to‐papers (they authored)Actors‐to‐Movies (they appeared in)Users‐to‐Movies (they rated) “Folded” networks:Author collaboration networksMovie co‐rating networks

I1 I2

I3 I4

2

1

2

M * MT

12

1

2

E1

E4

MT * M

E2

E3

11

Page 34: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

One of the founding fathersSocial Network Analysis

34

Page 35: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

35

Any guesses what this is and what it shows?Social Homophily

Page 36: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

36

Any guesses what this is and what it shows?Social Homophily

Page 37: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

37

Gender homophily in adolescent school groupSocial Homophily

Density or Distinction? The Roles of Data Structure and Group Detection Methods in Describing Adolescent Peer GroupsGest, S. et al. Journal of Social Structure: Vol. 6, http://www.cmu.edu/joss/content/articles/volume8/GestMoody/

Page 38: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

38

Role of Attributes

• Add insights to visualizations and analysis

• May include:– Demographic characteristics

(age, gender, race)– Other characteristics (income,

education, location)– Use of the system (number of

logins, blogs posted, edits made)

– Network Metrics (centralization)

• Often designated by color, shape, size and/or opacity of the vertices

Social Networks

B

CD

A

Page 39: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

39

Any guesses what this is and what it shows?Social Homophily

Page 40: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

40

The beauty of social networksSocial Homophily

ted.com/talks/nicholas_christakis_the_hidden_influence_of_social_networks.html

Page 41: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

41

Regardless of cause the resulting pattern isSocial Homophily

Clusters of connections emerge throughout the social space such that there is dense intra-connectivity within clusters

and sparse inter-connectivity between clusters

Page 42: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

42

What produces a pattern of homophily?Social Homophily

Cause: Similarity breeds connectionEffect: Connection breeds similaritySpurious: Some other factor is @ work

Page 43: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

43

In the world of U.S. politics?Social Homophily

DiscourseViewingVotingPurchasing…

Antagonistic Polarization?

Page 44: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

BalkanizationSocial Homophily

44

Process of fragmentation or division of a region or state into smaller regions or states that are often hostile or non-cooperative with each other

Page 45: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

45

Balkanization of the US Congress (Data Sets)Social Homophily

senate.gov/reference/

Index/Votes.htm

clerk.house.gov/legislative/

legvotes.aspx

govtrack.us/developers

Before Threshold

AfterThreshold

"Senate Voting Patterns," Morris, F.analyticalvisions.blogspot.com/2006/04/senate-voting-patterns-part-2.html

Page 46: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

46

Balkanization of US Senate (1989-2013) - VotesSocial Homophily

U.S. Senate More Divided Than Ever Data Shows,Swallow,E. ,Forbes, Oct. 17, 2013, forbes.com/sites/ericaswallow/2013/11/17/senate-voting-relationships-data/Partisan voting patterns – 1989 to presentInteractive data visualization static.davidchouinard.com/congress/

Page 47: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

47

Balkanization of US Senate (1975-2012) - VotesSocial Homophily

Portrait of Political Party Polarization, Moody J. and P. Mucha, April 2013, Network Science, journals.cambridge.org/action/displayAbstract?fromPage=online&aid=8888768

Page 48: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Are political books of “a feather” purchased together?Social Homophily

Aug 2008 Oct 2008

The Social Life of Books,Krebs V., orgnet.com/divided.html

Page 49: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Cyberbalkanization of the Political Blogsphere

• Single day snapshot of a Snowball Sample of Political Blogs (N=1490)

• Manually assigned as Liberal or Conservative

• Focus on blogrolls and front page citations

• Are we witnessing an era of “Cyberbalkinization?”

Social Homophily

49

Page 50: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

50

An aside – what’s a (We)blog?

• Information or discussion site – Focused on a particular subject area– Discrete entries posted in reverse

chronological order– Primarily text– Trend toward multiple author entries

• Some stats? – Wordpress ~ 60M blogs with 100M

pageviews/day– Tumblr ~ 150M blogs with 90M posts/day– Blogger ~ 46M unique visitors/month

Social Homophily

Page 51: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

51

Snowball SampleSocial Homophily

Political Blog Directories

eTalkingHead BlogCatalog Blogarama

Closed for business

CampaignLine

Individual Political Blogrolls

Page 52: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

52

Social Network PackagesSocial Homophily

UCIUCI UCIPajek NodeXLUCI,Pajek

& ORA ORA

Page 53: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

53

Social Network Analysis of Blogs with PajekSocial Homophily

Page 54: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

54

Social Network Analysis of Blogs with PajekSocial Homophily

• Colors and bold simply for illustration• Liberal Blogs (nos. 1-758)• Conservative Blogs (nos. 759 – 1490)

.CLU Partition File(0-Lib; 1 – ConsOne entry per vertex)

*vertices 1490000…0111…1

.NET Network File(Vertex descriptions followed By Edge descriptions)

*vertices 14901 100monkeystyping 0.0 0.02 12thharmonic 0.0 0.0...757 yoder 0.0 0.0758 younglibs 0.0 0.0...759 84rules 0.0 0.0760 a100wwe 0.0 0.0...1489 zeke01 0.0 0.01490 zeph1z 0.0 0.0*edges1 23 1.01 55 1.0...757 65 1.0760 854 1.0760 855 1.0...1489 1431 1.01490 802 1.0

Page 55: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

55

Social Network Analysis of Blogs with PajekSocial Homophily

Draw Network

Circular Layout w. labels

Fructerman-Reingold2D Layout with labels

F-R 2D Layout w/o labels

Draw Partition

F-R 2D Layout w/o labels partitioned by pol. position

Page 56: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Cyberbalkanization of the Political Blogsphere?Social Homophily

56

Proliferation of specialized online political news sources allows people with different political leanings to be exposed only to information in agreement with their previously held views?

Citation Network – link implies that a political blog has a URL on it’s front page referencing another blog

Page 57: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Cyberbalkanization of the Political Blogsphere?Social Homophily

57

N=1490Edges = 16715

N=758Edges = 7301

Liberals

N=732Edges = 7839

Conservatives

Page 58: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

58

Cyberbalkanization of the Political BlogsphereSocial Homophily

Page Citation

L

C

N=Top 202 MnthPosts = 12470Citations =1398

N=Top 202 MnthPosts = 10414Citations =2422

N = 312(13%)

N = 247(18%)

N = 2110(87%)

N = 1151(82%)

Blogrolls

L

C

PolBlog

N=732Avg. Links = 15.1

N=758Avg. Links=13.2

82%

84%

67%

74%

N=1490

92%

8% 10%

90%

Page 59: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

59

Polarization on Twitter

• Twitter communications six weeks prior to 2010 mid-term elections (9/14-11/1)

• Examined 250,000 “politically relevant” tweets (specific hashtags)

• Focused on two networks – retweets and mentions using “streaming API”

Social Homophily

Page 60: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

60

An Aside – What is Twitter?

• Each tweet <= 140 characters (avg. 10-15 words/message)

• Heavy presence of non-alpha symbols, abbrevs, misspellings and slang

• Tweets often include– Retweets (original tweet

repeated)– Mentions (user mentions

another user in a tweet)– Hashtags “#” (metadata

annotation denoting topic or audience)

Social Homophily

http://www.forbes.com/sites/marketshare/2013/01/23/the-number-one-mistake-retail-brands-make-when-it-comes-to-twitter/

Page 61: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

61

Polarization on Twitter – Social Networks (in study)Social Homophily

A Bretweets

Network 1:

Amentions

B

Network 2:

Both representing potential pathways for information to flow from A to B

N=23,766

N=10,142

Page 62: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

62

Polarization on Twitter – Social Networks

<?xml version="1.0" encoding="UTF-8"?><graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"><!-- Created by igraph --> <key id="cluster" for="node" attr.name="cluster" attr.type="string"/> <key id="tags" for="edge" attr.name="tags" attr.type="string"/> <key id="type" for="edge" attr.name="type" attr.type="string"/> <key id="urls" for="edge" attr.name="urls" attr.type="double"/> <key id="time" for="edge" attr.name="time" attr.type="string"/> <graph id="G" edgedefault="directed"> <node id="n0"> <data key="cluster">right</data> </node> <node id="n2"> <data key="cluster">left</data> </node> ... <edge source="n12464" target="n7349"> <data key="tags">[&apos;#tcot&apos;, &apos;#tlot&apos;]</data> <data key="type">retweet</data> <data key="urls">1</data> <data key="time">1286901355</data> </edge>... </graph></graphml>

Social Homophily

Gephi RepresentationGraphicXMLRetweet Network

Source:http://carl.cs.indiana.edu/data/icwsm/icwsm_polarization.zip

Page 63: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

63

Polarization on Twitter – Gephi Analysis Social Homophily

Forced Directed

Layout (Force Atlas 2)

Nodes coloredby Cluster(Political Ideology)

Page 64: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

64

Contributions of Twitter Study

• Cluster analysis of networks derived from corpus shows that the networks of retweets exhibits clear segregation while the mentioned network is dominated by single cluster.

• Manual classification of Twitter user by political alignment, demonstrating that the retweet network clusters corresponding to political left and right. They also show the mention network is political.

• Interpretation of the observed community structures based on injection of partisan content into ideologically opposed hashtag information streams

Social Homophily

Retweets

Mentions

Page 65: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

65

Polarization – The other side of the coin

• In addition to citations, Adamic and Glance also considered the similarity in the textual content of the blogs

• Identified 498 “bigrams” across the Top 40 blogs

• Computed cosine similarity measure between TD-IDF scores for all pairs of blogs

• Average similarity between opposites was .10, among conservatives was .54 and liberals was .57.

Social Homophily

Oh, buy the way…

Page 66: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

66

Mining Text Content

• “Scholars have long recognized that much of politics is expressed in words.”

• “To understand what politics is about we need to know what political actors are saying and writing.”

• “…The assumption is that ideology dominates the language used in the text.”

• Ergo, another way to study the polarization of political blogs is to compare what they say.

Social Homophily

Page 67: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

67

Mining Text Content - Political Blogs ExampleSocial Homophily

Republican nightmare begins: Obamacare is 'a godsend' for people getting coverage

U.S. President Barack Obama delivers remarks alongside Human Services Secretary Kathleen Sebelius (R) and other Americans the White House says will benefit from the opening of health insurance marketplaces under the Affordable Care Act, in the Rose Garden.

The health care exchanges—federal and state—are now functioning and not sucking up all the oxygen around the implementation of Obamacare. Finally, the "good news" news stories are finally being told, like this one in The New York Times.…That's the kind of story that makes Republicans quake in their shoes. For more of them, go below the fold.

Conservative or Liberal?

Page 68: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

68

Mining Text Content – Political Blogs Example

The Colorado #obamacare exchange is a disaster.

I’m sorry, but there’s no other way to describe these enrollment numbers. Which are, by the way, from a state exchange – meaning that the problems of healthcare.gov should theoretically be irrelevant to the conversation:

Enrollment have grown to 15,074, marketplace officials announced Monday. Marketplace officials set a goal of 136,000 people covered on exchanged-based plans by the end of 2014, but so far the exchange has failed to reach even worst-case enrollment projections.

The Democratic-controlled Coloradan state government is blaming this mess on people supposedly not having to replace their insurance now and bad press about the national exchange. Which is easier than blaming the Democrats now running the state exchange, particularly since the head of the exchange wanted a merit raise for all her work on it. It’s certainly easier than simply admitting that they, maybe there wasn’t all that much of a burning need for Obamacare in the first place.

But surely that’s crazy talk.

Yes, I am sorry. Bad numbers = people not being insured next year. What I am not is responsible for any of this.

Social Homophily

Conservative or Liberal?

Page 69: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

69

Mining Text Content – Political Blogs ExampleSocial Homophily

Sample of 50 blog entries about “Obamacare” from popular Liberal &

Conservative Political Blogs

Are they separated into 2 distinct clusters based on their content?

Page 70: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

70

Mining Text Content vs. Data MiningSocial Homophily

70

Data Consolidation

Data Transformation

Data Cleaning

Data Reduction

Well-FormedData

Real-WorldData

Cross-Industry Standard Process for Data Mining

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modeling

Evaluation

Deployment

Page 71: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Social Homophily

71

Some Basic DM Tasks

• Anomaly Detection • Association Rule

Learning • Clustering • Classification • Regression • Summarization

Mining Text Content vs. Data Mining

Page 72: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

72

Mining Text Content vs. Data MiningSocial Homophily

DM Data is usually

• Structured• Transformed• Well-formed

Page 73: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

73

Unstructured Data• No specified format• Variable length• Variable spelling• Punctuation and

non-alphanumeric characters

• Contents are not predefined and no predefined set of values

Mining Text Content – The ProcessSocial Homophily

Page 74: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

74

Mining Text Content – The ProcessSocial Homophily

The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

NLP

Page 75: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

75

Mining Text Content – The ProcessSocial Homophily

“The most consequential, and shocking, step we will take is to discard the order in which words occur in documents. We will assume documents are a bag of words, where order does not inform our analyses… If this assumption is unpalatable, we can retain some word order by including bigrams (word pairs) or trigrams..”

NLP

Page 76: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Mining Text Content – The ProcessSocial Homophily

76

BusinessUnderstanding

DocumentUnderstanding

DocumentPreparation

Modeling

Evaluation

Deployment

DocumentConsolidation

Corpus Refinement(Token, Stem, Stop…)

Establish theCorpus

Feature Selection & Weighting

Doc-Term Matrix*

Documents

CRISP-Like Processes

Real-WorldText Data

Page 77: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Mining Text Content – The Process

• Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text.

• Normalize — Convert them to lowercase.

• Eliminate stop words — Eliminate terms that appear very often (e.g. the, and, …).

• Stemming — Convert the terms into their stemmed form—remove plurals and different word forms (e.g. achieve, achieves, achieved – achiev) [note: word about synonyms – WordNet Synset]

Social Homophily

77

Tokenization NormalizeEliminate

Stop WordsStemming

Common representation of tokens within and between documents

Page 78: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Mining Text Content – The ProcessSocial Homophily

78

Doc/Token Matrix:Vectors of Words, Terms or Tokens by Doc

“Bag of Words, Terms or Tokens”

“Bag of Words” (BOW) or Vector Space Model (VSM): Words or Tokens are attributes and documents are examples

FeatureExtraction

Documents Token Sets

nation – 5civil – 1war – 2men – 2died – 4people – 5liberty – 1god – 1…

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all mean are created equal.Now we are engaged in a great civil war, testing whether that nation, or…

Feature Extraction & Weighting

Page 79: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

79

Mining Text Content – Political Blog ExampleSocial Homophily

Tokens: ['The', 'Colorado', '#', 'obamacare', 'exchange', 'is', 'a', 'disaster', '.']Remove Stopwords: ['Colorado', '#', 'obamacare', 'exchange', 'disaster', '.']Lower case: ['colorado', '#', 'obamacare', 'exchange', 'disaster', '.']Alpha only: ['colorado', 'obamacare', 'exchange', 'disaster']Stems: ['colorado', 'obamacar', 'exchang', 'disast']

For each sentence

{'one-year': 1, 'offici': 2, 'merit': 1, 'particularli': 1, 'coloradan': 1, 'explain': 2, 'theoret': 1, 'certainli': 1, 'easier': 2, 'mayb': 1, 'far': 1, 'nation': 1, 'disast': 1, 'govern': 1, 'press': 1, 'democrat': 1, 'bad': 2, 'mid-rang': 1, 'mean': 1, 'set': 1, 'replac': 1, 'respons': 1, 'fail': 1, 'even': 1, 'decre': 2, 'state': 3, 'estim': 1, 'democratic-control': 1, 'run': 1, 'obamacar': 2, 'burn': 1, 'blame': 2, 'let': 1, 'enrol': 6, 'worst-cas': 2, 'could': 1, 'unenforc': 1, 'admit': 1, 'place': 1, 'presid': 1, 'opinion': 1, 'first': 1, 'simpli': 1, 'colorado': 2, 'one-tim': 1, 'exchang': 5, 'number': 2, 'cancel': 1, 'announc': 1, 'supposedli': 1, 'describ': 1, 'would': 1, 'pictur': 1, 'hey': 1, 'next': 1, 'much': 1, 'way': 2, 'head': 1, 'offer': 1, 'peopl': 4, 'compani': 1, 'exchanged-bas': 1, 'loom': 1, 'whether': 1, 'work': 2, 'project': 2, 'sinc': 1, 'problem': 1, 'aw': 1, 'site': 1, 'crazi': 1, 'want': 1, 'need': 1, 'grown': 1, 'irrelev': 1, 'goal': 1, 'renew': 1, 'marketplac': 2, 'sure': 1, 'insur': 3, 'monday': 1, 'mess': 1, 'reach': 1, 'rais': 1, 'plan': 1, 'date': 1, 'end': 1, 'mid-novemb': 1, 'scenario': 1, 'cover': 1, 'talk': 1}

Vector of stems for all sentences in blog posting

Analysis from

PythonNLTK

Program

Page 80: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

80

Mining Text Content – Political Blog ExampleSocial Homophily

Stem Distribution from Obamacare Postings50 Most Frequently Used out of 792 Stems by Political Ideology

Doc-StemMatrix(from PythonNLTK

Program)

Page 81: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

81

Mining Text Content – Political Blog ExampleSocial Homophily

Differences in Stem Frequencies by Political Ideology

Stem Distribution from Obamacare PostingsFor Stems whose proportions are between .1 and .7 by Political Ideology

Page 82: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Mining Text Content – The Process

Transforming Frequencies and Weights• Binary Frequencies: tf =1 for tf>0; otherwise 0• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0• Normalized Frequencies: Divide each frequency by SQRT

of Sum of Squares of the frequencies within the vector (column)

• Term Frequency–Inverse Document Frequency– TF: Freq of term for given doc/ sum of words for given doc– Inverse Document Frequency: log(N/(1+D)) where N is total

number of docs and D is number with term– TF * IDF

Social Homophily

82

Page 83: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

83

Mining Text Content – The ProcessSocial Homophily

Page 84: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

Mining Text Content – Political Blog ExampleSocial Homophily

84

HierarchicalClusteringBased on TD-IDFOf Stems fromObamacare PostingsBy Political Position

Average Cosine Similarities

Cons = .25Lib = .27Cons-Lib = .26

Page 85: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

85

Mining Text Content – Retweets and Mentions

Content Homogenity• An interesting question, therefore,

is whether it has any significance in terms of the actual content of the discussions involved.

• To address this issue we associate each user with a profile vector containing all the hashtags in her tweets, weighted by their frequencies.

• We can then compute the cosine similarities between each pair of user profiles, separately for users in the same cluster and users in different clusters.

Social Homophily

Page 86: Mining and analyzing social media   part 1 - hicss47 tutorial - dave king

86

Mining Text Content – Retweets and MentionsSocial Homophily

Cosine Similarities

Average Similarities