people search @ study group msra nlc
DESCRIPTION
TRANSCRIPT
P83-1
People search, TwitterRank and Trendsetters finding in Twitter
Beijing, September, 2012
MSRA NLC Study Group
Yi Lu, Jie Liu
2
• Input: A query including expertise topic, such as database system, software engineering.
• Output: A list including people ranked with topic relevance.
Query List of people
Background: People Search
3
• Input:
• Output:
An Illustrative Example
4
• A student looks for a machine learning supervisor
• A patient looks for doctors who have many successful cases on his disease
• A historian looks for people who have expertise on Maya culture
• A CTO looks for engineers who have related skills
• …
Scenarios of People Search
5
• Identify opinion leaders, experts• Advertisement• Turn to somebody for help• Select a team to do a specific task• A lot of challenges remains.
Motivation
6
• Wisdom of the Crowd– Cognos Crowdsourcing Search for Topic Experts in Microblogs (SIGIR
2012)– Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy
Ganguly, Krishna P. Gummadi• Tweets and Link Relation
– TwitterRank: Finding Topic-sensitive Influential Twitterers (WSDM 2010)
– Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He• #Hashtag and Link Relation
– Finding Trendsetters in Information Networks (SIGKDD 2012)– Diego Saez-Trumper, Giovanni Comarela, Virgílio Almeida, Ricardo
Baeza-Yates, Fabrício Benevenuto
Outline
P83-7
Cognos: Crowdsourcing Search for Topic Experts in Microblogs
Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto,
Niloy Ganguly, Krishna P. Gummadi
8
• Twitter is now an important source of current news– 500 million users post 400 million tweets daily
• Quality of tweets posted by different users vary widely– News, pointless babble, conversational tweets,
spam, … • Challenge: to find topic experts
– Sources of authoritative information on specific topics
Topic experts in Twitter
9
• Existing approaches– Research studies: Pal [WSDM 11], Weng [WSDM 10]– Application systems: Twitter Who-To-Follow,
Wefollow, …• Existing approaches primarily rely on
information provided by the user herself– Bio, contents of tweets, network features e.g.
#followers• We rely on “wisdom of the Twitter crowd”
– How do others describe a user?
Identifying topic experts in Twitter
10
• Challenges in designing search system for topic experts in Twitter– How to infer topics of expertise of an individual
Twitter user? – How to rank the relative expertise of users
identified as experts on a topic?
Challenges
11
HOW TO INFER TOPICS OF EXPERTISE OF TWITTER USERS?
Challenge 1
Challenge #1
12
13
• A feature to organize tweets received from the people whom a user is following
• Create a List, add name & description, add Twitter users to the list
• Tweets from all listed users will be available as a separate List stream
Twitter Lists
14
• Collect Lists containing a given user U
• Identify U’s topics from List meta-data– Basic NLP techniques– Extract nouns and adjectives
• Extracted words collected to obtain a topic document for user
[movies tv hollywood stars entertainment celebrity hollywood …]
Mining Lists to infer expertise
15
• Collected Lists of 55 million Twitter users who joined before or in 2009
• All analyses consider 1.3 million users who are included in 10 or more Lists
Dataset
16
linux, tech, open, software, libre, gnu, computer, developer, ubuntu, unix
politics, senator, congress, government, republicans, Iowa, gop, conservative
politics, senate, government, congress, democrats, Missouri, progressive, women
Topics extracted from Lists
17
love, daily, people, time, GUI, movie, video, life, happy, game, cool
Most common words from tweets
celeb, actor, famous, movie, stars, comedy, music, Hollywood, pop culture
Most common words from Lists
Profile bio
Lists vs. other features
18
Fallon, happy, love, fun, video, song, game, hope, #fjoln, #fallonmono
Most common words from tweets
celeb, funny, humor, music, movies, laugh, comics, television, entertainers
Most common words from Lists
Profile bio
Lists vs. other features
19
• Top 20 WTF results for 200 queries 3495 users
• Whether the results returned by Cognos cover the results returned by Twitter WTF?
• For 83.4%, yes
• From among the rest 16.6%, manual inspection of a random sample shows two major cases
Cognos vs. Twitter Who-To-Follow
20
We can find Twitterer dineLA in Twitter if the query is “dining”
We can find Twitterer Space explorer HubbleHugger77 in Twitter if the query is “hubble”
Case 1 – topics inferred from Lists include semantically similar words, but not exact query-word
Topics from Lists – food, restaurant, recipes, Los Angeles
Topics from Lists – science, tech, space, cosmology, NASA
More than one way to express an idea
21
We can find Comedian jimmyfallon if the query is “astrophysicist”
Case 2 – results by Twitter unrelated to query
Topics from Lists – celebs, comedy, humor, actor
Results returned by Twitter is unrelated
22
• List-based method provides accurate & comprehensive inference of topics of expertise of Twitter users
• In many cases, more accurate than existing approaches that utilize profile information or tweets
Inferring expertise: Summary
23
HOW TO RANK EXPERTS ON A TOPIC?
Challenge 2
Challenge #2
24
• Used a ranking scheme solely based on Lists • Two components of ranking user U w.r.t. query
Q– Relevance of user to query – cover density ranking
between topic document TU of user and Q– Popularity of user – number of Lists including the
user
Topic relevance(TU, Q) × log(#Lists including U)
Ranking experts
25
• Search system for topic experts in Twitter
• Given a query (topic)– Identify experts on the topic using Lists– Rank identified experts
Cognos
26
Cognos results for “politics”
27
Cognos results for “stem cell”
28
• System deployed and evaluated ‘in-the-wild’• Evaluators were students & researchers from
the three home institutes of authors
Evaluation of Cognos
29
User-evaluation of Cognos
30
Sample queries for evaluation
31
• Overall 2136 relevance judgments– 1680 said relevant (78.7%)
• Large amount of subjectivity in evaluations– Same result for same query received both relevant
and non-relevant judgments– E.g., for query “cloud computing”, Werner Vogels
got 4 relevant judgments, 6 non-relevant judgments
Evaluation results
chief technology officer and vice President of Amazon.com in Seattle
32
• Considered only the results evaluated at least twice
• Result said to be relevant if voted relevant in the majority of evaluations
• Mean Average Precision considering top 10 results: 93.9 %
Evaluation results
33
Cognos vs. Twitter Who-To-Follow
34
• Considering 27 distinct queries asked at least twice• Judgment by majority voting
• Cognos judged better on 12 queries– Computer science, Linux, Mac, Apple, Ipad, Internet,
Windows phone, photography, political journalist, …
• Twitter Who-To-Follow judged better on 11 queries– Music, Sachin Tendulkar, Anjelina Jolie, Harry Potter,
metallica, cloud computing, IIT Kharagpur, …
Cognos vs. Twitter Who-To-Follow
35
Results for query music
P83-36
Questions
37
• Wisdom of the Crowd– Cognos Crowdsourcing Search for Topic Experts in Microblogs (SIGIR
2012)– Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy
Ganguly, Krishna P. Gummadi• Tweets and Link Relation
– TwitterRank: Finding Topic-sensitive Influential Twitterers (WSDM 2010)
– Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He• #Hashtag and Link Relation
– Finding Trendsetters in Information Networks (SIGKDD 2012)– Diego Saez-Trumper, Giovanni Comarela, Virgílio Almeida, Ricardo
Baeza-Yates, Fabrício Benevenuto
Outline
P83-38
TwitterRank: Finding Topic-sensitive Influential Twitterers
Jianshu Weng, Ee-Peng Lim, Jing JiangSingapore Management University
Qi HePennsylvania State University
39
• Introduction• Dataset• Topic Modeling• TwitterRank
Outline
40
• Given a set of twitterers, find the influential ones– for different topics
• Challenges:– Topics unknown
Introduction
41
• Introduction• Dataset• Topic Modeling• TwitterRank
Outline
42
• Crawled = a set of Singapore-based twitterers from twitterholic.com with highest number of followers.
• For each , crawled its followers and friends and
• For each get its published tweets. Denote the set of all tweets as
Data preparation
43
|S| 996
|S*| 6748 (4050 with more than 10 tweets)
|| 1,021,039
# following relationships 49,872
Min/Max/Avg #tweets/twitterer 1/3200/179.57
Data preparation
44
Reciprocity in the Following Relationships
• Friend count = # twitterers being followed• Follower count = # twitterers following• Correlation between friends count and follow
count
45
Reciprocity in the Following Relationships
• 72.4% of the users follow more than 80% of their followers
• 80.5% of the users have 80% of their friends follow them back
46
• Homophily• Twitters with “following” relationships are
more similar than those without, according to the topics they are interested in.
Explanations
47
• Introduction• Dataset• Topic Modeling• TwitterRank
Outline
48
• Apply LDA to distill topics automatically• Find topics in the twitterer’s content to represent his
interests– Twitterer’s content = aggregated tweets
• Pre-processing– Use only those words without non-English characters– Min word length= 3– Remove @userid, URL, All-digit word, stopwords
• Apply analysis on twitterers with more than 10tweets. (#twitterer=4050)
Topic Distillation
49
• Three matrices:– DT, a D x T matrix, where D is the number of twitterers and T
is the number of topics. contains the number of times a word in tweets of twitterer has been assigned to topic
– WT, a W x T matrix, where W is the number of unique words used in the tweets and T is the number of topics. captures the number of times unique word has been assigned to topic
– Z, a 1 x N vector, where N is the total number of words in the tweets. is the topic assignment forword
Results of Topic Distillation
50
• Introduction• Dataset• Topic Modeling• TwitterRank
Outline
51
• A topic-specific random walk model is applied to calculate each twitterer’s influential score.
• The transition matrix for topic t, denoted as . The transition probability of the random surfer from follower to friend .
– Where S is the set of ’s friends– = 1 - | |
Topic-specific TwitterRank
DT’ is row-normalized form of matrix DT
52
• This captures two notions:– The more publishes, the higher portion of tweets
reads is from . Generally, this leads to a higher influence on
– ’s influence on is also related to the topical similarity between the two as suggested by the homophily phenomenon.
Topic-specific TwitterRank
53
• Topic-specific teleportation
• The influence scores of twitterers arecalculated iteratively
– is the t-th column of matrix DT’’, which is the column-normalized form of matrix DT
Topic-specific TwitterRank
54
• General influence: can be set as the probabilities of different topics’ presence
• Perceived general influence: can also be set as the probabilities that a particular twitterer is interested in different topics.
Aggregation of Topic-specific TwitterRank
P83-55
Questions
56
Outline
• Wisdom of the Crowd– Cognos Crowdsourcing Search for Topic Experts in Microblogs (SIGIR
2012)– Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy
Ganguly, Krishna P. Gummadi• Tweets and Link Relation
– TwitterRank: Finding Topic-sensitive Influential Twitterers (WSDM 2010)
– Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He• #Hashtag and Link Relation
– Finding Trendsetters in Information Networks (SIGKDD 2012)– Diego Saez-Trumper, Giovanni Comarela, Virgílio Almeida, Ricardo
Baeza-Yates, Fabrício Benevenuto
P83-57
Finding Trendsetters in InformationNetworks
P83-58
What is a Trendsetter?
P83-59
What is a Trendsetter?
Trendsetters are people:
Adopt and spread new trends before these trends becomepopular.
Propagate these trends over the network.
P83-60
Finding trendsetters in a graph
P83-61
Who are the trendsetters?
P83-62
Key Point
P83-63
Time
P83-64
How to find Trendsetters?
P83-65
Weight edges and run PageRank
P83-66
Topics and Influence Model
P83-67
Topics
Topic: collection of trends (Urls, memes, #hashtags, quotes, etc)
For each node we store the timestamp when he adopt a trendh1
P83-68
Graph
• We denote as the induced graph of G(N,E) over the topic k.
• The set is obtained by considering all nodes of N that used at least one trend of k
• The set represent all edges (u, v) such that, if (u, v) E and ∈ u, v then (u, v) ∈ ∈
P83-69
Weight Edges
Let be the time when node v adopts the trend k (= 0, if v ∈ ∈does not adopt ).
We define two vectors, (for all v ) and (for all (u, v) ),∈ ∈
each one with components given respectively by:
𝑠1(𝑣 )𝑖={ 1 ,𝑖𝑓 𝑡 𝑖 (𝑣 )>00 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒
And
𝑠2(𝑢 ,𝑣 )𝑖={𝑒−Δ𝛼 , 𝑖𝑓 𝑡𝑖 (𝑣 )>0 𝑎𝑛𝑑𝑡𝑖 (𝑣 )<𝑡 𝑖 (𝑢 )
0 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒
For I = 1 , …, , where = - and > 0
P83-70
Weight Edges
Vector informs if node v adopted (or not) each trend of k, while shows if u adopted these trends after v and weights the relation as a function of the period of time between and .
For a fixed , if→ 0+ then → 1 and if → +∞ then → 0.
These limits mean that if the node u adopts a trend just after v then is very close to
P83-71
Weight Edges
(u, v) = ×
Let be an induced graph of a network G(N,E) over a topic k with trends. For each (u, v) we define ∈the influence of v over u by:
where the operator · refers to the scalar product, ||x|| tothe Euclidian norm of any vector x, and to thenumber of components of (u, v) that are different from 0.If ||s2(u, v)|| = 0, we define (u, v) = 0. It is important tonotice that, by definition, ||s1(v)|| 0 for all v .∈
𝐿(𝑠2(𝑢 ,𝑣))𝑁 𝑘
𝑠1(𝑣 )· 𝑠2(𝑢 ,𝑣)
|| 𝑠1(𝑣) || × || 𝑠2(𝑢 ,𝑣) ||
P83-72
One important fact is that u can be influenced to adopt atrend of k by several nodes in . So, we normalize(u, v) as follows:
Definition(u, v) =
Normalize
P83-73
TS Ranking
Definition
The trendsetters (TS) rank of node v in a network ,
denoted by , is given by:
= d *
where 0 ≤ d ≤ 1 is the damping factor and is a probability distribution over all nodes of . In this paper, we consider a uniform (v) = 1/| | for all v , but this distribution could be topic dependent.∈
P83-74
Evaluation
P83-75
Baseline
In-degree rankingPageRank
P83-76
Dataset
Twitter until August 2009.
Over 50 Millions users with all their followers and followees.1.6 Billions tweets
We use #tags as trends.
P83-77
Example:Iran Elections on Twitter
P83-78
Example
Iran Elections: {#iran, #iranelections,#tehran}
TS : @Lara (“Reporting from the Middle East”)PR : @cnnbr (“CNN Breaking News”)
P83-79
Category #Topics ExampleofHashtags #Tweets
Celebrity 16 #michaeljackson,#niley 1,036,101
Games 13 #mafiawars,#ps3# 2,556,437
Idioms 35 #musicmonday,#followfriday 7,882,209
Movies 29 #heroes,#tv 1,769,945
Music 33 #lastfm,#musicmonday 2,785,522
None 153 #quotes,#sale 2,227,971
Political 39 #honduras,#Iranelection, 8,156,786
Sports 27 #soccer,#rugby 1,914,061
Technology 41 #twitter,#android 7,459,471
BaselineWe use the #tag classification made by Romero et al.
P83-80
Trendsetters: early adopters?
P83-81
% o
f To
p−10
0 U
sers
bef
ore
the
peak
IDIOMS GAMES POLITICAL NONE MOVIES TECHNO. SPORTS0
CELEBRITY MUSIC
90
Category
Experiments I
100InDegree
PageRank
TrendSetters
80
70
60
50
40
30
20
10
P83-82
In-degree vs adoption time
P83-83
Nod
e I
nDeg
ree
−100 −80 −60 −40 −20 0 20
Time
Experiments II
4x 102.5
IDPRTS
2
1.5
1
0.5
0
P83-84
Influenced Followers Ratio
P83-85
Influenced Followers Ratio
IFk(v) is the fraction of followers of v that adopted at least one trendof the topic k after v.
CategoryPOLITICALCELEBRITYMUSICGAMESSPORTSIDIOMSNONETECHNOLOGYMOVIES
(%)ID0.0130.0150.0130.0220.0040.0010.0110.0060.006
(%)PR0.0840.0890.0960.0580.0540.0340.0010.0540.043
(%)TS0.1740.1480.1600.1150.0980.0880.0850.0780.067
P83-86
Num
ber
of T
op−
100
user
s f
ound
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.1
70
60
50
40
30
20
10
Ranking with Partial Information
100
90
80
Ratio of users considered (sorted by time)
musicmonday TSmusicmonday PRiranelection TSiranelection PRSwineflue TSswineflue PRfollowfriday TSfollowfriday PRmw2 TSmw2 PRfb TSfb PRf1 tsf1 prmichaeljackson tsmichaeljakson pr
P83-87
Final Remarks
Usually, follower hubs (celebrities) are late adopters.
Trendsetters have lower in-degree, but they spread new ideas.
P83-88
Questions