mining and analyzing social media part 1 - hicss47 tutorial - dave king
DESCRIPTION
Part 1 of a Tutorial on Mining and Analyzing Social Media at HICSS 47TRANSCRIPT
Mining and Analyzing Social Media: Part 1Dave King
HICSS 47 - January 2014
• Introduction– Bio– Some definitions– Growing Interest– Resources
• Social Homophily as an Example– Meaning and Import– Political Cyberbalkanization
• Social Network Analysis• Text Mining
Agenda: Part 1
2
• Introduction to Social Network Analysis Metrics– Degrees of Separation– Hooray for Bollywood
• Standard Measures– Centrality– Cohesion
• Levels of Social Network Analysis – Facebook & Egocentric Analysis– Searching for Cohesive Subgroups
Agenda: Part 2
3
4
BiographyDave King
UniversityProfessor
Dir R&DExecucom
SVP & CTOComshare
EVP R&DJDA Software
App StatsMath Modeling
DSSOptimization
App Stats
Nat Lang forDB Query
DSS InterfaceHICSS (83)
‘70 ‘80 ‘90 ‘00 ‘10
DSSMulti-Dim DB
EIS & BI
Merch PlanningSCP PlanningOptimization
EComHICSS
Track Intern& Dig Econ
HICSSTutorial
Web Mining &Analysis
SentimentAnalysis
Social Media Defined
5
View from the PinwheelSocial Media
7
View from the Word CloudsSocial Media
Defined
8
is the media we use to be social. That’s it.
Two ApproachesMining and Analyzing Social Media
9
Connections
Content
ContentSocial Media
Blogs
Tags
Emails
Reviews
Recommendations
Tweets SMS
IM Chat
Ratings
Profiles
Bookmarks Photos
Articles
Presentations
Wiki Entries
News Stories
Conversations
Discussion
Lifestreams
Music
RSS Feeds
FAQs Video
Location
11
Social ConnectionsSocial Media
Friends
Followers
Authors
Readers
Family
Collaborators
Participants
Senders
Recipients
Groups
Contacts
Subscribers Collaborators
Answerers
ConnectionsSocial Media
ObjectsPeople
Places
Types of Mining and Analysis
13
The ContentMining and Analyzing Social Media
Web MiningData Mining focused on the
analysis of Web Usage, Structure & Content
Data MiningDiscovering meaningful patterns
from large data sets using pattern recognition technologies
Text MiningUsing natural language processing & data mining to discover patterns
in a collection of “documents
15
The ConnectionsMining and Analyzing Social Media
Social Network Analysis [SNA]
The mapping and measuring of relationships and flows between people, groups, organizations, computers,
URLs, and other connected information/knowledge
entities.
Network science Study of the theoretical foundations of network
structure and behavior and the application of networks to
many subfields including SNA, collaboration networks,
emergent systems, and physical and life science
systems
InterestYou are not alone or maybe you are?
16
17
Search - Google TrendsInterest
Data Mining
Big Data
Web MiningText Mining
Web Mining
Social Network Analysis
Network Science
Sentiment Analysis
Social Media Mining
Data Analytics Mathematical Statistics
Book Titles – Google Ngram ViewerInterest
18
Mathematical Statistics
Social Network Analysis
Network Science
Data Mining
Data Analytics
Big Data
Text Mining
Web Mining
Sentiment Analysis
Social Media**
ResourcesCustomers who spent a lot of money on these items also…
19
General Coverage and ProgrammingWeb & Text Mining Resources
20
Computational Models and ProgrammingText Mining & Social Network Analysis
21
General Coverage of SNASocial Network Analysis
Mathematical Models and SNA ToolsSocial Network Analysis
UCI
UCI UCIPajek ORAUCI,Pajek
& ORA NodeXL
Online Tutorials and ClassesSocial Network Analysis
An ExampleSocial Homophily from Two Perspectives
25
26
aka Assortative MixingSocial Homophily
Sound bitesLove of the sameBirds of a feather flock togetherLikes associate with likesLikes copy likesLikes think alike
27
Some things that are homogenousSocial Homophily
Key ElementsSocial Network Analysis
A
B
C
D
Graph[ V,E, f ]
Vertex(Node)
Edge(Link)
Vertices or Nodes The “things”
Edges or LinksThe “relationships”
Graph or NetworkThe set of vertices/nodes, edges/links and the relationship/function connecting them.
29
Types of Vertices
• Social Entities– People or social structures such as
workgroups, teams, organizations, institutions, states, or even countries.
• Content– Web pages, keyword tags, or videos.
• Locations– Physical or virtual locations or events.
• Primary building blocks of social media– Friends in social networking sites,
posts or authors in blogs, or pages in wikis.
Social Network Analysis
A
B
C
D
What does a Vertex
represent?
30
Types of Relations
• Similarities– Location (same spatial and temporal space)– Participation (same club, same event, …)– Attributes (Age, gender, same attitudes, …)
• Relational Roles– Kinship (mother of, sibling of, …)– Other Roles (friend of, boss of, …)
• Relational Cognition– Affective (Likes, Hates…)– Perceptual (Knows, Knows of, …)
• Relational Events– Interactions (Sold to, talked to, helped, …)– Flows (Information, beliefs, money, …)
Social Network Analysis
A
B
C
D
What does an Edge
represent?
Types of Edges or LinksSocial Network Analysis
31
Undirected,Unweighted
A B
C
Facebook Friends
Undirected,Weighted
A B
C
CollaborationNetwork
A B
C
TwitterFollowers
Directed,Unweighted
A B
C
EmailNetwork
Directed, Weighted100
6070
20
5
10Width oflines
Alternative RepresentationsSocial Network Analysis
32
Graph Edge List Adjacency Matrix
Adjacency List XML
A
B
CD
33
Multimodal networksSocial Network Analysis
I1 I2 I3 I4
E1 E2 E3 E4
Bipartite or Bimodal NetworkLinking individuals to events(participation, membership, topic, tag, post, …)
Examples:Authors‐to‐papers (they authored)Actors‐to‐Movies (they appeared in)Users‐to‐Movies (they rated) “Folded” networks:Author collaboration networksMovie co‐rating networks
I1 I2
I3 I4
2
1
2
M * MT
12
1
2
E1
E4
MT * M
E2
E3
11
One of the founding fathersSocial Network Analysis
34
35
Any guesses what this is and what it shows?Social Homophily
36
Any guesses what this is and what it shows?Social Homophily
37
Gender homophily in adolescent school groupSocial Homophily
Density or Distinction? The Roles of Data Structure and Group Detection Methods in Describing Adolescent Peer GroupsGest, S. et al. Journal of Social Structure: Vol. 6, http://www.cmu.edu/joss/content/articles/volume8/GestMoody/
38
Role of Attributes
• Add insights to visualizations and analysis
• May include:– Demographic characteristics
(age, gender, race)– Other characteristics (income,
education, location)– Use of the system (number of
logins, blogs posted, edits made)
– Network Metrics (centralization)
• Often designated by color, shape, size and/or opacity of the vertices
Social Networks
B
CD
A
39
Any guesses what this is and what it shows?Social Homophily
40
The beauty of social networksSocial Homophily
ted.com/talks/nicholas_christakis_the_hidden_influence_of_social_networks.html
41
Regardless of cause the resulting pattern isSocial Homophily
Clusters of connections emerge throughout the social space such that there is dense intra-connectivity within clusters
and sparse inter-connectivity between clusters
42
What produces a pattern of homophily?Social Homophily
Cause: Similarity breeds connectionEffect: Connection breeds similaritySpurious: Some other factor is @ work
43
In the world of U.S. politics?Social Homophily
DiscourseViewingVotingPurchasing…
Antagonistic Polarization?
BalkanizationSocial Homophily
44
Process of fragmentation or division of a region or state into smaller regions or states that are often hostile or non-cooperative with each other
45
Balkanization of the US Congress (Data Sets)Social Homophily
senate.gov/reference/
Index/Votes.htm
clerk.house.gov/legislative/
legvotes.aspx
govtrack.us/developers
Before Threshold
AfterThreshold
"Senate Voting Patterns," Morris, F.analyticalvisions.blogspot.com/2006/04/senate-voting-patterns-part-2.html
46
Balkanization of US Senate (1989-2013) - VotesSocial Homophily
U.S. Senate More Divided Than Ever Data Shows,Swallow,E. ,Forbes, Oct. 17, 2013, forbes.com/sites/ericaswallow/2013/11/17/senate-voting-relationships-data/Partisan voting patterns – 1989 to presentInteractive data visualization static.davidchouinard.com/congress/
47
Balkanization of US Senate (1975-2012) - VotesSocial Homophily
Portrait of Political Party Polarization, Moody J. and P. Mucha, April 2013, Network Science, journals.cambridge.org/action/displayAbstract?fromPage=online&aid=8888768
Are political books of “a feather” purchased together?Social Homophily
Aug 2008 Oct 2008
The Social Life of Books,Krebs V., orgnet.com/divided.html
Cyberbalkanization of the Political Blogsphere
• Single day snapshot of a Snowball Sample of Political Blogs (N=1490)
• Manually assigned as Liberal or Conservative
• Focus on blogrolls and front page citations
• Are we witnessing an era of “Cyberbalkinization?”
Social Homophily
49
50
An aside – what’s a (We)blog?
• Information or discussion site – Focused on a particular subject area– Discrete entries posted in reverse
chronological order– Primarily text– Trend toward multiple author entries
• Some stats? – Wordpress ~ 60M blogs with 100M
pageviews/day– Tumblr ~ 150M blogs with 90M posts/day– Blogger ~ 46M unique visitors/month
Social Homophily
51
Snowball SampleSocial Homophily
Political Blog Directories
eTalkingHead BlogCatalog Blogarama
Closed for business
CampaignLine
Individual Political Blogrolls
52
Social Network PackagesSocial Homophily
UCIUCI UCIPajek NodeXLUCI,Pajek
& ORA ORA
53
Social Network Analysis of Blogs with PajekSocial Homophily
54
Social Network Analysis of Blogs with PajekSocial Homophily
• Colors and bold simply for illustration• Liberal Blogs (nos. 1-758)• Conservative Blogs (nos. 759 – 1490)
.CLU Partition File(0-Lib; 1 – ConsOne entry per vertex)
*vertices 1490000…0111…1
.NET Network File(Vertex descriptions followed By Edge descriptions)
*vertices 14901 100monkeystyping 0.0 0.02 12thharmonic 0.0 0.0...757 yoder 0.0 0.0758 younglibs 0.0 0.0...759 84rules 0.0 0.0760 a100wwe 0.0 0.0...1489 zeke01 0.0 0.01490 zeph1z 0.0 0.0*edges1 23 1.01 55 1.0...757 65 1.0760 854 1.0760 855 1.0...1489 1431 1.01490 802 1.0
55
Social Network Analysis of Blogs with PajekSocial Homophily
Draw Network
Circular Layout w. labels
Fructerman-Reingold2D Layout with labels
F-R 2D Layout w/o labels
Draw Partition
F-R 2D Layout w/o labels partitioned by pol. position
Cyberbalkanization of the Political Blogsphere?Social Homophily
56
Proliferation of specialized online political news sources allows people with different political leanings to be exposed only to information in agreement with their previously held views?
Citation Network – link implies that a political blog has a URL on it’s front page referencing another blog
Cyberbalkanization of the Political Blogsphere?Social Homophily
57
N=1490Edges = 16715
N=758Edges = 7301
Liberals
N=732Edges = 7839
Conservatives
58
Cyberbalkanization of the Political BlogsphereSocial Homophily
Page Citation
L
C
N=Top 202 MnthPosts = 12470Citations =1398
N=Top 202 MnthPosts = 10414Citations =2422
N = 312(13%)
N = 247(18%)
N = 2110(87%)
N = 1151(82%)
Blogrolls
L
C
PolBlog
N=732Avg. Links = 15.1
N=758Avg. Links=13.2
82%
84%
67%
74%
N=1490
92%
8% 10%
90%
59
Polarization on Twitter
• Twitter communications six weeks prior to 2010 mid-term elections (9/14-11/1)
• Examined 250,000 “politically relevant” tweets (specific hashtags)
• Focused on two networks – retweets and mentions using “streaming API”
Social Homophily
60
An Aside – What is Twitter?
• Each tweet <= 140 characters (avg. 10-15 words/message)
• Heavy presence of non-alpha symbols, abbrevs, misspellings and slang
• Tweets often include– Retweets (original tweet
repeated)– Mentions (user mentions
another user in a tweet)– Hashtags “#” (metadata
annotation denoting topic or audience)
Social Homophily
http://www.forbes.com/sites/marketshare/2013/01/23/the-number-one-mistake-retail-brands-make-when-it-comes-to-twitter/
61
Polarization on Twitter – Social Networks (in study)Social Homophily
A Bretweets
Network 1:
Amentions
B
Network 2:
Both representing potential pathways for information to flow from A to B
N=23,766
N=10,142
62
Polarization on Twitter – Social Networks
<?xml version="1.0" encoding="UTF-8"?><graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"><!-- Created by igraph --> <key id="cluster" for="node" attr.name="cluster" attr.type="string"/> <key id="tags" for="edge" attr.name="tags" attr.type="string"/> <key id="type" for="edge" attr.name="type" attr.type="string"/> <key id="urls" for="edge" attr.name="urls" attr.type="double"/> <key id="time" for="edge" attr.name="time" attr.type="string"/> <graph id="G" edgedefault="directed"> <node id="n0"> <data key="cluster">right</data> </node> <node id="n2"> <data key="cluster">left</data> </node> ... <edge source="n12464" target="n7349"> <data key="tags">['#tcot', '#tlot']</data> <data key="type">retweet</data> <data key="urls">1</data> <data key="time">1286901355</data> </edge>... </graph></graphml>
Social Homophily
Gephi RepresentationGraphicXMLRetweet Network
Source:http://carl.cs.indiana.edu/data/icwsm/icwsm_polarization.zip
63
Polarization on Twitter – Gephi Analysis Social Homophily
Forced Directed
Layout (Force Atlas 2)
Nodes coloredby Cluster(Political Ideology)
64
Contributions of Twitter Study
• Cluster analysis of networks derived from corpus shows that the networks of retweets exhibits clear segregation while the mentioned network is dominated by single cluster.
• Manual classification of Twitter user by political alignment, demonstrating that the retweet network clusters corresponding to political left and right. They also show the mention network is political.
• Interpretation of the observed community structures based on injection of partisan content into ideologically opposed hashtag information streams
Social Homophily
Retweets
Mentions
65
Polarization – The other side of the coin
• In addition to citations, Adamic and Glance also considered the similarity in the textual content of the blogs
• Identified 498 “bigrams” across the Top 40 blogs
• Computed cosine similarity measure between TD-IDF scores for all pairs of blogs
• Average similarity between opposites was .10, among conservatives was .54 and liberals was .57.
Social Homophily
Oh, buy the way…
66
Mining Text Content
• “Scholars have long recognized that much of politics is expressed in words.”
• “To understand what politics is about we need to know what political actors are saying and writing.”
• “…The assumption is that ideology dominates the language used in the text.”
• Ergo, another way to study the polarization of political blogs is to compare what they say.
Social Homophily
67
Mining Text Content - Political Blogs ExampleSocial Homophily
Republican nightmare begins: Obamacare is 'a godsend' for people getting coverage
U.S. President Barack Obama delivers remarks alongside Human Services Secretary Kathleen Sebelius (R) and other Americans the White House says will benefit from the opening of health insurance marketplaces under the Affordable Care Act, in the Rose Garden.
The health care exchanges—federal and state—are now functioning and not sucking up all the oxygen around the implementation of Obamacare. Finally, the "good news" news stories are finally being told, like this one in The New York Times.…That's the kind of story that makes Republicans quake in their shoes. For more of them, go below the fold.
Conservative or Liberal?
68
Mining Text Content – Political Blogs Example
The Colorado #obamacare exchange is a disaster.
I’m sorry, but there’s no other way to describe these enrollment numbers. Which are, by the way, from a state exchange – meaning that the problems of healthcare.gov should theoretically be irrelevant to the conversation:
Enrollment have grown to 15,074, marketplace officials announced Monday. Marketplace officials set a goal of 136,000 people covered on exchanged-based plans by the end of 2014, but so far the exchange has failed to reach even worst-case enrollment projections.
The Democratic-controlled Coloradan state government is blaming this mess on people supposedly not having to replace their insurance now and bad press about the national exchange. Which is easier than blaming the Democrats now running the state exchange, particularly since the head of the exchange wanted a merit raise for all her work on it. It’s certainly easier than simply admitting that they, maybe there wasn’t all that much of a burning need for Obamacare in the first place.
But surely that’s crazy talk.
Yes, I am sorry. Bad numbers = people not being insured next year. What I am not is responsible for any of this.
Social Homophily
Conservative or Liberal?
69
Mining Text Content – Political Blogs ExampleSocial Homophily
Sample of 50 blog entries about “Obamacare” from popular Liberal &
Conservative Political Blogs
Are they separated into 2 distinct clusters based on their content?
70
Mining Text Content vs. Data MiningSocial Homophily
70
Data Consolidation
Data Transformation
Data Cleaning
Data Reduction
Well-FormedData
Real-WorldData
Cross-Industry Standard Process for Data Mining
BusinessUnderstanding
DataUnderstanding
DataPreparation
Modeling
Evaluation
Deployment
Social Homophily
71
Some Basic DM Tasks
• Anomaly Detection • Association Rule
Learning • Clustering • Classification • Regression • Summarization
Mining Text Content vs. Data Mining
72
Mining Text Content vs. Data MiningSocial Homophily
DM Data is usually
• Structured• Transformed• Well-formed
73
Unstructured Data• No specified format• Variable length• Variable spelling• Punctuation and
non-alphanumeric characters
• Contents are not predefined and no predefined set of values
Mining Text Content – The ProcessSocial Homophily
74
Mining Text Content – The ProcessSocial Homophily
The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.
NLP
75
Mining Text Content – The ProcessSocial Homophily
“The most consequential, and shocking, step we will take is to discard the order in which words occur in documents. We will assume documents are a bag of words, where order does not inform our analyses… If this assumption is unpalatable, we can retain some word order by including bigrams (word pairs) or trigrams..”
NLP
Mining Text Content – The ProcessSocial Homophily
76
BusinessUnderstanding
DocumentUnderstanding
DocumentPreparation
Modeling
Evaluation
Deployment
DocumentConsolidation
Corpus Refinement(Token, Stem, Stop…)
Establish theCorpus
Feature Selection & Weighting
Doc-Term Matrix*
Documents
CRISP-Like Processes
Real-WorldText Data
Mining Text Content – The Process
• Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text.
• Normalize — Convert them to lowercase.
• Eliminate stop words — Eliminate terms that appear very often (e.g. the, and, …).
• Stemming — Convert the terms into their stemmed form—remove plurals and different word forms (e.g. achieve, achieves, achieved – achiev) [note: word about synonyms – WordNet Synset]
Social Homophily
77
Tokenization NormalizeEliminate
Stop WordsStemming
Common representation of tokens within and between documents
Mining Text Content – The ProcessSocial Homophily
78
Doc/Token Matrix:Vectors of Words, Terms or Tokens by Doc
“Bag of Words, Terms or Tokens”
“Bag of Words” (BOW) or Vector Space Model (VSM): Words or Tokens are attributes and documents are examples
FeatureExtraction
Documents Token Sets
nation – 5civil – 1war – 2men – 2died – 4people – 5liberty – 1god – 1…
Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all mean are created equal.Now we are engaged in a great civil war, testing whether that nation, or…
Feature Extraction & Weighting
79
Mining Text Content – Political Blog ExampleSocial Homophily
Tokens: ['The', 'Colorado', '#', 'obamacare', 'exchange', 'is', 'a', 'disaster', '.']Remove Stopwords: ['Colorado', '#', 'obamacare', 'exchange', 'disaster', '.']Lower case: ['colorado', '#', 'obamacare', 'exchange', 'disaster', '.']Alpha only: ['colorado', 'obamacare', 'exchange', 'disaster']Stems: ['colorado', 'obamacar', 'exchang', 'disast']
For each sentence
…
{'one-year': 1, 'offici': 2, 'merit': 1, 'particularli': 1, 'coloradan': 1, 'explain': 2, 'theoret': 1, 'certainli': 1, 'easier': 2, 'mayb': 1, 'far': 1, 'nation': 1, 'disast': 1, 'govern': 1, 'press': 1, 'democrat': 1, 'bad': 2, 'mid-rang': 1, 'mean': 1, 'set': 1, 'replac': 1, 'respons': 1, 'fail': 1, 'even': 1, 'decre': 2, 'state': 3, 'estim': 1, 'democratic-control': 1, 'run': 1, 'obamacar': 2, 'burn': 1, 'blame': 2, 'let': 1, 'enrol': 6, 'worst-cas': 2, 'could': 1, 'unenforc': 1, 'admit': 1, 'place': 1, 'presid': 1, 'opinion': 1, 'first': 1, 'simpli': 1, 'colorado': 2, 'one-tim': 1, 'exchang': 5, 'number': 2, 'cancel': 1, 'announc': 1, 'supposedli': 1, 'describ': 1, 'would': 1, 'pictur': 1, 'hey': 1, 'next': 1, 'much': 1, 'way': 2, 'head': 1, 'offer': 1, 'peopl': 4, 'compani': 1, 'exchanged-bas': 1, 'loom': 1, 'whether': 1, 'work': 2, 'project': 2, 'sinc': 1, 'problem': 1, 'aw': 1, 'site': 1, 'crazi': 1, 'want': 1, 'need': 1, 'grown': 1, 'irrelev': 1, 'goal': 1, 'renew': 1, 'marketplac': 2, 'sure': 1, 'insur': 3, 'monday': 1, 'mess': 1, 'reach': 1, 'rais': 1, 'plan': 1, 'date': 1, 'end': 1, 'mid-novemb': 1, 'scenario': 1, 'cover': 1, 'talk': 1}
Vector of stems for all sentences in blog posting
Analysis from
PythonNLTK
Program
80
Mining Text Content – Political Blog ExampleSocial Homophily
Stem Distribution from Obamacare Postings50 Most Frequently Used out of 792 Stems by Political Ideology
Doc-StemMatrix(from PythonNLTK
Program)
81
Mining Text Content – Political Blog ExampleSocial Homophily
Differences in Stem Frequencies by Political Ideology
Stem Distribution from Obamacare PostingsFor Stems whose proportions are between .1 and .7 by Political Ideology
Mining Text Content – The Process
Transforming Frequencies and Weights• Binary Frequencies: tf =1 for tf>0; otherwise 0• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0• Normalized Frequencies: Divide each frequency by SQRT
of Sum of Squares of the frequencies within the vector (column)
• Term Frequency–Inverse Document Frequency– TF: Freq of term for given doc/ sum of words for given doc– Inverse Document Frequency: log(N/(1+D)) where N is total
number of docs and D is number with term– TF * IDF
Social Homophily
82
83
Mining Text Content – The ProcessSocial Homophily
Mining Text Content – Political Blog ExampleSocial Homophily
84
HierarchicalClusteringBased on TD-IDFOf Stems fromObamacare PostingsBy Political Position
Average Cosine Similarities
Cons = .25Lib = .27Cons-Lib = .26
85
Mining Text Content – Retweets and Mentions
Content Homogenity• An interesting question, therefore,
is whether it has any significance in terms of the actual content of the discussions involved.
• To address this issue we associate each user with a profile vector containing all the hashtags in her tweets, weighted by their frequencies.
• We can then compute the cosine similarities between each pair of user profiles, separately for users in the same cluster and users in different clusters.
Social Homophily
86
Mining Text Content – Retweets and MentionsSocial Homophily
Cosine Similarities
Average Similarities