studying blogspace ravi kumar ibm almaden research center [email protected]
Post on 20-Dec-2015
231 views
TRANSCRIPT
![Page 1: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/1.jpg)
Studying BlogspaceStudying Blogspace
Ravi KumarRavi Kumar
IBM Almaden Research CenterIBM Almaden Research Center
[email protected]@almaden.ibm.com
![Page 2: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/2.jpg)
EtymologyEtymologyFrom the OED new ed. (draft entry, Mar 2003) …
blog intr. To write or maintain a weblog. Also: to read or browse through weblogs, esp. habitually.
web¢log n. 2. A frequently updated web site consisting of personal observations, excerpts from other sources, etc., typically run by a single person, and usually with hyperlinks to other sites; an online journal or diary.
From WWW 2003 (Kumar, Novak, Raghavan, Tomkins) …blog¢space n. The collection of weblogs; =
blogosphere, blogsphere, blogistan, …
![Page 3: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/3.jpg)
Blogs 101Blogs 101
• Characteristics– Pages with reverse chronological sequences of dated
entries– Usually contain a persistent sidebar containing profile
(and other blogs read by the author – “blogroll”)– Usually maintained and published by one of the
common variants of public-domain blog software• From Slashdot, 1999
“… a new, personal, and determinedly non-hostile evolution of the electric community. They are also the freshest example of how people use the Net to make their own, radically different new media”
![Page 4: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/4.jpg)
Look and feelLook and feel
• Quirky• Highly personal• Consumed by a small number of regular repeat
visitors• Often updated multiple times each day• Highly interwoven into a network of small but
active micro-communities
• Eg: LiveJournal, Xanga, DeadJournal, Blogger, Memepool, …
![Page 5: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/5.jpg)
The blog eraThe blog era
• Blogs began in 1996, but exploded in popularity in 1999– Proliferation of authoring tools
• Newsweek 2002 estimates ~500K • LiveJournal 2005 estimates ~3.5M • Annual Blogathon for charity
– Bloggers update their Blogs every 30m for 24h– Sponsors pay …
• Impact of blogs– “Miserable failure” on Google
![Page 6: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/6.jpg)
Structural studyStructural study(Kumar, Novak, Raghavan, Tomkins, CACM 2004)(Kumar, Novak, Raghavan, Tomkins, CACM 2004)
![Page 7: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/7.jpg)
Livejournal blogspaceLivejournal blogspace
• Livejournal.com: popular blog site
• 1.3M bloggers (Feb 2004)
• 3.5M bloggers (Apr 2005)
• Each blogger has a profile– Name, age, …– Geographic information (city, state, zip, …)– Friends and friend of– Interests/communities
![Page 8: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/8.jpg)
Eg, LiveJournal user “bill”Eg, LiveJournal user “bill”
![Page 9: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/9.jpg)
LJ bloggers in USLJ bloggers in US
< 1K< 5K< 10K< 25K< 50K~ 100K
![Page 10: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/10.jpg)
LJ bloggers world-wideLJ bloggers world-wide
< 1K< 2K< 5K~ 25K~ 50K~ 75K
![Page 11: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/11.jpg)
Who are they?Who are they?Age % Representative interests
![Page 12: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/12.jpg)
Friendship graphFriendship graph• Directed• 80% mutual• Average degree ~ 14• Power law degrees• Clustering coeff. ~ 0.2• Most friendships
explained by age, location, interest
Age 1%
Location20% Interest
16%
5%
16%
22%
![Page 13: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/13.jpg)
Evolutionary studyEvolutionary study (Kumar, Novak, Raghavan, Tomkins, WWW 2003)(Kumar, Novak, Raghavan, Tomkins, WWW 2003)
![Page 14: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/14.jpg)
Blogs and evolutionBlogs and evolution
• Every blog contains a dated record of– Every word ever written to the blog– Every link ever added in the blog
• Blogs are an increasingly important medium, but– Few systematic studies have been performed– Such study should take an evolutionary perspective
[Brewington et al] [Bharat et al] [Fetterly et al] [Cho et al]
– Tools for understanding evolution not fully understood
![Page 15: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/15.jpg)
Time graphsTime graphsv1 v2
v3
v4
time
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
v1 v2
v3 v4
Underlying graph Time graph
![Page 16: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/16.jpg)
Community evolution in blogsCommunity evolution in blogs
• What are the communities within the time graph? – Community definition, extraction– Graph-based methods (trawling)
[Kumar Raghavan Rajagopalan Tomkins, WWW 99]
• How active are these communities, and over what timeframe?– Burst analysis [Kleinberg, KDD 02]
![Page 17: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/17.jpg)
Community extractionCommunity extraction• Community analysis based
on graph structure• Idea: there are many
subgraphs that would never occur in a random graph – if we find such subgraphs, there must be some reason
• In blogspace, we enumerate dense subgraphs using a greedy heuristic
![Page 18: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/18.jpg)
Dense subgraph enumerationDense subgraph enumeration(heuristic)(heuristic)
• Scan edges, find triangles• For each triangle, greedily grow its neighbor set• Growth is allowable based on a measure of
connectivity to the current dense subgraph• Extracted “communities” are not unique
Current size (N) 2 <=6 <=9 <=20 >20
Must connect to 2 N-1 N-2 0.7N 0.6N
![Page 19: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/19.jpg)
Bursts: Static to dynamic Bursts: Static to dynamic communitiescommunities
• Phenomenon to characterize: A topic in a temporal stream occurs in a “burst of activity”
• Model source as multi-state• Each state has certain emission properties• Traversal between states is controlled by a
Markov model• Determine most likely underlying state
sequence over time, given observable output
![Page 20: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/20.jpg)
An exampleAn example
Time
I’ve been thinking about your idea with
the asparagus…
Uh huh I think I see…
Uh huh Yeah, that’s what I’m saying…
So then I said “Hey, let’s give
it a try”
And anyway she said
maybe, okay?
0.0051 2
0.01State 1:Output rate: very low
State 2:Output rate: very high
1 1 1 1 2 2 2
Most likely “hidden” sequence
![Page 21: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/21.jpg)
Some experimentsSome experiments
• Crawled 24,109 blogs from popular sites (2003)• Extract archive links from blogs• Extract all dates on blog pages, and tag each word
and link with a date
– Simple heuristics to automatically extract time-stamps
from entries (regular expressions, training, …)
– Obtained dates for ~90% of edges
![Page 22: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/22.jpg)
Experiments (contd.)Experiments (contd.)
• The time graph– 22,299 nodes, 70,472 unique edges– 0.77M multiedges (average edge multiplicity = 11)
• Consider graphs formed by prefixes from Jan 1, 1999 to some later month – generate 47 “prefix graphs” for analysis
• Enumerate communities and analyze their burstiness
![Page 23: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/23.jpg)
SCC growthSCC growth
Largest SCC as fraction of all nodes
2nd and 3rd largest SCCs as fraction of all nodes
![Page 24: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/24.jpg)
Connectivity in BlogspaceConnectivity in Blogspace
Fraction of nodes participatingIn some community
Number of communities
Number of nodes participating in a community
![Page 25: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/25.jpg)
Burstiness of communitiesBurstiness of communities
Number of communities in “high state” during each time period
![Page 26: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/26.jpg)
Are these results fluke?Are these results fluke?
• “Randomized Blogspace”: A distribution over time graphs that look very much like the time graph of Blogspace, but remove some of the locality of the true graph
• Vertices and edges arrive at the same times, each edge has the same source, but a randomly-chosen destination
• If randomized blogspace behaves like blogspace, then community structure is a fake
![Page 27: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/27.jpg)
SCC evolutionSCC evolution
Blogspace
Randomized Blogspace
Randomized Blogspace formsan SCC much earlier
![Page 28: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/28.jpg)
Community Community evolutionevolution
Blogspace
Randomized Blogspace
Blogspace has manymore communities
![Page 29: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/29.jpg)
Exogenous eventsExogenous events
Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace
Number of blog pages that belong to a community
Number of blog communities
Wired magazine publishes an article on weblogs that impacts the tech community
NewsWeek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption
![Page 30: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/30.jpg)
Some questions …Some questions …
• Modeling– Edge arrivals– `Interesting’ events
• Algorithms– Prediction– Information percolation – Search– o(t ¢ T(n))
• Studies– Sociological – Effect on search and ranking
![Page 31: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/31.jpg)
Prediction via blogsPrediction via blogs (Gruhl, Guha, Kumar, Novak, Tomkins, 2005)(Gruhl, Guha, Kumar, Novak, Tomkins, 2005)
![Page 32: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/32.jpg)
Blogs as trend indicatorsBlogs as trend indicators
• Can blogs be used to predict trends?
• Data– Amazon sales rank of some books– Blog chatter in an index
• Questions– How well do they correlate?– Can sales rank be predicted using blogs?
![Page 33: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/33.jpg)
The Lance Armstrong Performance The Lance Armstrong Performance ProgramProgram
Query: Lance ArmstrongOR Tour de France
![Page 34: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/34.jpg)
Vanity FairVanity Fair
![Page 35: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/35.jpg)
Cross-correlation for Lance Cross-correlation for Lance ArmstrongArmstrong
![Page 36: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/36.jpg)
Simple inferencesSimple inferences
• How to formulate queries automatically– Depends on the object (book, movie, …)– Simple heuristics work well
• Predicting sales motion is hard
• Predicting spikes appears relatively easier
• More to be done …
![Page 37: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/37.jpg)
Blogs and social networksBlogs and social networks (Kumar, Liben-Nowell, Novak, Raghavan, Tomkins, 2005)(Kumar, Liben-Nowell, Novak, Raghavan, Tomkins, 2005)
![Page 38: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/38.jpg)
Social networksSocial networks
• Blog friendship graph is a social network
• Is there a simple model to describe this network?
• Desiderata– Fit experimental observations– Exhibit “six-degrees of separation”– Theoretically tractable
![Page 39: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/39.jpg)
RBF: RBF: Rank-Based FriendshipRank-Based Friendship
• Population network model• Each person has a geographic location• d(¢, ¢) = measures geographic distance• rankA(B) = #{ C : d(A, C) < d(A, B) }• Pr[A “befriends” B] / 1/rankA(B)
– Independent of distance– Works with arbitrary population densities
• Plus local links to neighbors
![Page 40: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com](https://reader031.vdocuments.net/reader031/viewer/2022020111/56649d455503460f94a22028/html5/thumbnails/40.jpg)
RBF: Preliminary resultsRBF: Preliminary results
• Fits LiveJournal friendship experimental graph data (using geo data in the profile)
• Greedy routing: Is able to route messages from source to destination most of the time, just using geographic information
• Theoretical analysis: Can show that this model guarantees geographic routing to work