people search @ study group msra nlc

88
People search, TwitterRank and Trendsetters finding in Twitter Beijing, September, 2012 MSRA NLC Study Group Yi Lu, Jie Liu 1

Upload: yi-lu

Post on 23-Dec-2014

80 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: People Search @ Study Group MSRA NLC

P83-1

People search, TwitterRank and Trendsetters finding in Twitter

Beijing, September, 2012

MSRA NLC Study Group

Yi Lu, Jie Liu

Page 2: People Search @ Study Group MSRA NLC

2

• Input: A query including expertise topic, such as database system, software engineering.

• Output: A list including people ranked with topic relevance.

Query List of people

Background: People Search

Page 3: People Search @ Study Group MSRA NLC

3

• Input:

• Output:

An Illustrative Example

Page 4: People Search @ Study Group MSRA NLC

4

• A student looks for a machine learning supervisor

• A patient looks for doctors who have many successful cases on his disease

• A historian looks for people who have expertise on Maya culture

• A CTO looks for engineers who have related skills

• …

Scenarios of People Search

Page 5: People Search @ Study Group MSRA NLC

5

• Identify opinion leaders, experts• Advertisement• Turn to somebody for help• Select a team to do a specific task• A lot of challenges remains.

Motivation

Page 6: People Search @ Study Group MSRA NLC

6

• Wisdom of the Crowd– Cognos Crowdsourcing Search for Topic Experts in Microblogs (SIGIR

2012)– Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy

Ganguly, Krishna P. Gummadi• Tweets and Link Relation

– TwitterRank: Finding Topic-sensitive Influential Twitterers (WSDM 2010)

– Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He• #Hashtag and Link Relation

– Finding Trendsetters in Information Networks (SIGKDD 2012)– Diego Saez-Trumper, Giovanni Comarela, Virgílio Almeida, Ricardo

Baeza-Yates, Fabrício Benevenuto

Outline

Page 7: People Search @ Study Group MSRA NLC

P83-7

Cognos: Crowdsourcing Search for Topic Experts in Microblogs

Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto,

Niloy Ganguly, Krishna P. Gummadi

Page 8: People Search @ Study Group MSRA NLC

8

• Twitter is now an important source of current news– 500 million users post 400 million tweets daily

• Quality of tweets posted by different users vary widely– News, pointless babble, conversational tweets,

spam, … • Challenge: to find topic experts

– Sources of authoritative information on specific topics

Topic experts in Twitter

Page 9: People Search @ Study Group MSRA NLC

9

• Existing approaches– Research studies: Pal [WSDM 11], Weng [WSDM 10]– Application systems: Twitter Who-To-Follow,

Wefollow, …• Existing approaches primarily rely on

information provided by the user herself– Bio, contents of tweets, network features e.g.

#followers• We rely on “wisdom of the Twitter crowd”

– How do others describe a user?

Identifying topic experts in Twitter

Page 10: People Search @ Study Group MSRA NLC

10

• Challenges in designing search system for topic experts in Twitter– How to infer topics of expertise of an individual

Twitter user? – How to rank the relative expertise of users

identified as experts on a topic?

Challenges

Page 11: People Search @ Study Group MSRA NLC

11

HOW TO INFER TOPICS OF EXPERTISE OF TWITTER USERS?

Challenge 1

Challenge #1

Page 12: People Search @ Study Group MSRA NLC

12

Page 13: People Search @ Study Group MSRA NLC

13

• A feature to organize tweets received from the people whom a user is following

• Create a List, add name & description, add Twitter users to the list

• Tweets from all listed users will be available as a separate List stream

Twitter Lists

Page 14: People Search @ Study Group MSRA NLC

14

• Collect Lists containing a given user U

• Identify U’s topics from List meta-data– Basic NLP techniques– Extract nouns and adjectives

• Extracted words collected to obtain a topic document for user

[movies tv hollywood stars entertainment celebrity hollywood …]

Mining Lists to infer expertise

Page 15: People Search @ Study Group MSRA NLC

15

• Collected Lists of 55 million Twitter users who joined before or in 2009

• All analyses consider 1.3 million users who are included in 10 or more Lists

Dataset

Page 16: People Search @ Study Group MSRA NLC

16

linux, tech, open, software, libre, gnu, computer, developer, ubuntu, unix

politics, senator, congress, government, republicans, Iowa, gop, conservative

politics, senate, government, congress, democrats, Missouri, progressive, women

Topics extracted from Lists

Page 17: People Search @ Study Group MSRA NLC

17

love, daily, people, time, GUI, movie, video, life, happy, game, cool

Most common words from tweets

celeb, actor, famous, movie, stars, comedy, music, Hollywood, pop culture

Most common words from Lists

Profile bio

Lists vs. other features

Page 18: People Search @ Study Group MSRA NLC

18

Fallon, happy, love, fun, video, song, game, hope, #fjoln, #fallonmono

Most common words from tweets

celeb, funny, humor, music, movies, laugh, comics, television, entertainers

Most common words from Lists

Profile bio

Lists vs. other features

Page 19: People Search @ Study Group MSRA NLC

19

• Top 20 WTF results for 200 queries 3495 users

• Whether the results returned by Cognos cover the results returned by Twitter WTF?

• For 83.4%, yes

• From among the rest 16.6%, manual inspection of a random sample shows two major cases

Cognos vs. Twitter Who-To-Follow

Page 20: People Search @ Study Group MSRA NLC

20

We can find Twitterer dineLA in Twitter if the query is “dining”

We can find Twitterer Space explorer HubbleHugger77 in Twitter if the query is “hubble”

Case 1 – topics inferred from Lists include semantically similar words, but not exact query-word

Topics from Lists – food, restaurant, recipes, Los Angeles

Topics from Lists – science, tech, space, cosmology, NASA

More than one way to express an idea

Page 21: People Search @ Study Group MSRA NLC

21

We can find Comedian jimmyfallon if the query is “astrophysicist”

Case 2 – results by Twitter unrelated to query

Topics from Lists – celebs, comedy, humor, actor

Results returned by Twitter is unrelated

Page 22: People Search @ Study Group MSRA NLC

22

• List-based method provides accurate & comprehensive inference of topics of expertise of Twitter users

• In many cases, more accurate than existing approaches that utilize profile information or tweets

Inferring expertise: Summary

Page 23: People Search @ Study Group MSRA NLC

23

HOW TO RANK EXPERTS ON A TOPIC?

Challenge 2

Challenge #2

Page 24: People Search @ Study Group MSRA NLC

24

• Used a ranking scheme solely based on Lists • Two components of ranking user U w.r.t. query

Q– Relevance of user to query – cover density ranking

between topic document TU of user and Q– Popularity of user – number of Lists including the

user

Topic relevance(TU, Q) × log(#Lists including U)

Ranking experts

Page 25: People Search @ Study Group MSRA NLC

25

• Search system for topic experts in Twitter

• Given a query (topic)– Identify experts on the topic using Lists– Rank identified experts

Cognos

Page 26: People Search @ Study Group MSRA NLC

26

Cognos results for “politics”

Page 27: People Search @ Study Group MSRA NLC

27

Cognos results for “stem cell”

Page 28: People Search @ Study Group MSRA NLC

28

• System deployed and evaluated ‘in-the-wild’• Evaluators were students & researchers from

the three home institutes of authors

Evaluation of Cognos

Page 29: People Search @ Study Group MSRA NLC

29

User-evaluation of Cognos

Page 30: People Search @ Study Group MSRA NLC

30

Sample queries for evaluation

Page 31: People Search @ Study Group MSRA NLC

31

• Overall 2136 relevance judgments– 1680 said relevant (78.7%)

• Large amount of subjectivity in evaluations– Same result for same query received both relevant

and non-relevant judgments– E.g., for query “cloud computing”, Werner Vogels

got 4 relevant judgments, 6 non-relevant judgments

Evaluation results

chief technology officer and vice President of Amazon.com in Seattle

Page 32: People Search @ Study Group MSRA NLC

32

• Considered only the results evaluated at least twice

• Result said to be relevant if voted relevant in the majority of evaluations

• Mean Average Precision considering top 10 results: 93.9 %

Evaluation results

Page 33: People Search @ Study Group MSRA NLC

33

Cognos vs. Twitter Who-To-Follow

Page 34: People Search @ Study Group MSRA NLC

34

• Considering 27 distinct queries asked at least twice• Judgment by majority voting

• Cognos judged better on 12 queries– Computer science, Linux, Mac, Apple, Ipad, Internet,

Windows phone, photography, political journalist, …

• Twitter Who-To-Follow judged better on 11 queries– Music, Sachin Tendulkar, Anjelina Jolie, Harry Potter,

metallica, cloud computing, IIT Kharagpur, …

Cognos vs. Twitter Who-To-Follow

Page 35: People Search @ Study Group MSRA NLC

35

Results for query music

Page 36: People Search @ Study Group MSRA NLC

P83-36

Questions

Page 37: People Search @ Study Group MSRA NLC

37

• Wisdom of the Crowd– Cognos Crowdsourcing Search for Topic Experts in Microblogs (SIGIR

2012)– Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy

Ganguly, Krishna P. Gummadi• Tweets and Link Relation

– TwitterRank: Finding Topic-sensitive Influential Twitterers (WSDM 2010)

– Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He• #Hashtag and Link Relation

– Finding Trendsetters in Information Networks (SIGKDD 2012)– Diego Saez-Trumper, Giovanni Comarela, Virgílio Almeida, Ricardo

Baeza-Yates, Fabrício Benevenuto

Outline

Page 38: People Search @ Study Group MSRA NLC

P83-38

TwitterRank: Finding Topic-sensitive Influential Twitterers

Jianshu Weng, Ee-Peng Lim, Jing JiangSingapore Management University

Qi HePennsylvania State University

Page 39: People Search @ Study Group MSRA NLC

39

• Introduction• Dataset• Topic Modeling• TwitterRank

Outline

Page 40: People Search @ Study Group MSRA NLC

40

• Given a set of twitterers, find the influential ones– for different topics

• Challenges:– Topics unknown

Introduction

Page 41: People Search @ Study Group MSRA NLC

41

• Introduction• Dataset• Topic Modeling• TwitterRank

Outline

Page 42: People Search @ Study Group MSRA NLC

42

• Crawled = a set of Singapore-based twitterers from twitterholic.com with highest number of followers.

• For each , crawled its followers and friends and

• For each get its published tweets. Denote the set of all tweets as

Data preparation

Page 43: People Search @ Study Group MSRA NLC

43

|S| 996

|S*| 6748 (4050 with more than 10 tweets)

|| 1,021,039

# following relationships 49,872

Min/Max/Avg #tweets/twitterer 1/3200/179.57

Data preparation

Page 44: People Search @ Study Group MSRA NLC

44

Reciprocity in the Following Relationships

• Friend count = # twitterers being followed• Follower count = # twitterers following• Correlation between friends count and follow

count

Page 45: People Search @ Study Group MSRA NLC

45

Reciprocity in the Following Relationships

• 72.4% of the users follow more than 80% of their followers

• 80.5% of the users have 80% of their friends follow them back

Page 46: People Search @ Study Group MSRA NLC

46

• Homophily• Twitters with “following” relationships are

more similar than those without, according to the topics they are interested in.

Explanations

Page 47: People Search @ Study Group MSRA NLC

47

• Introduction• Dataset• Topic Modeling• TwitterRank

Outline

Page 48: People Search @ Study Group MSRA NLC

48

• Apply LDA to distill topics automatically• Find topics in the twitterer’s content to represent his

interests– Twitterer’s content = aggregated tweets

• Pre-processing– Use only those words without non-English characters– Min word length= 3– Remove @userid, URL, All-digit word, stopwords

• Apply analysis on twitterers with more than 10tweets. (#twitterer=4050)

Topic Distillation

Page 49: People Search @ Study Group MSRA NLC

49

• Three matrices:– DT, a D x T matrix, where D is the number of twitterers and T

is the number of topics. contains the number of times a word in tweets of twitterer has been assigned to topic

– WT, a W x T matrix, where W is the number of unique words used in the tweets and T is the number of topics. captures the number of times unique word has been assigned to topic

– Z, a 1 x N vector, where N is the total number of words in the tweets. is the topic assignment forword

Results of Topic Distillation

Page 50: People Search @ Study Group MSRA NLC

50

• Introduction• Dataset• Topic Modeling• TwitterRank

Outline

Page 51: People Search @ Study Group MSRA NLC

51

• A topic-specific random walk model is applied to calculate each twitterer’s influential score.

• The transition matrix for topic t, denoted as . The transition probability of the random surfer from follower to friend .

– Where S is the set of ’s friends– = 1 - | |

Topic-specific TwitterRank

DT’ is row-normalized form of matrix DT

Page 52: People Search @ Study Group MSRA NLC

52

• This captures two notions:– The more publishes, the higher portion of tweets

reads is from . Generally, this leads to a higher influence on

– ’s influence on is also related to the topical similarity between the two as suggested by the homophily phenomenon.

Topic-specific TwitterRank

Page 53: People Search @ Study Group MSRA NLC

53

• Topic-specific teleportation

• The influence scores of twitterers arecalculated iteratively

– is the t-th column of matrix DT’’, which is the column-normalized form of matrix DT

Topic-specific TwitterRank

Page 54: People Search @ Study Group MSRA NLC

54

• General influence: can be set as the probabilities of different topics’ presence

• Perceived general influence: can also be set as the probabilities that a particular twitterer is interested in different topics.

Aggregation of Topic-specific TwitterRank

Page 55: People Search @ Study Group MSRA NLC

P83-55

Questions

Page 56: People Search @ Study Group MSRA NLC

56

Outline

• Wisdom of the Crowd– Cognos Crowdsourcing Search for Topic Experts in Microblogs (SIGIR

2012)– Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy

Ganguly, Krishna P. Gummadi• Tweets and Link Relation

– TwitterRank: Finding Topic-sensitive Influential Twitterers (WSDM 2010)

– Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He• #Hashtag and Link Relation

– Finding Trendsetters in Information Networks (SIGKDD 2012)– Diego Saez-Trumper, Giovanni Comarela, Virgílio Almeida, Ricardo

Baeza-Yates, Fabrício Benevenuto

Page 57: People Search @ Study Group MSRA NLC

P83-57

Finding Trendsetters in InformationNetworks

Page 58: People Search @ Study Group MSRA NLC

P83-58

What is a Trendsetter?

Page 59: People Search @ Study Group MSRA NLC

P83-59

What is a Trendsetter?

Trendsetters are people:

Adopt and spread new trends before these trends becomepopular.

Propagate these trends over the network.

Page 60: People Search @ Study Group MSRA NLC

P83-60

Finding trendsetters in a graph

Page 61: People Search @ Study Group MSRA NLC

P83-61

Who are the trendsetters?

Page 62: People Search @ Study Group MSRA NLC

P83-62

Key Point

Page 63: People Search @ Study Group MSRA NLC

P83-63

Time

Page 64: People Search @ Study Group MSRA NLC

P83-64

How to find Trendsetters?

Page 65: People Search @ Study Group MSRA NLC

P83-65

Weight edges and run PageRank

Page 66: People Search @ Study Group MSRA NLC

P83-66

Topics and Influence Model

Page 67: People Search @ Study Group MSRA NLC

P83-67

Topics

Topic: collection of trends (Urls, memes, #hashtags, quotes, etc)

For each node we store the timestamp when he adopt a trendh1

Page 68: People Search @ Study Group MSRA NLC

P83-68

Graph

• We denote as the induced graph of G(N,E) over the topic k.

• The set is obtained by considering all nodes of N that used at least one trend of k

• The set represent all edges (u, v) such that, if (u, v) E and ∈ u, v then (u, v) ∈ ∈

Page 69: People Search @ Study Group MSRA NLC

P83-69

Weight Edges

Let be the time when node v adopts the trend k (= 0, if v ∈ ∈does not adopt ).

We define two vectors, (for all v ) and (for all (u, v) ),∈ ∈

each one with components given respectively by:

𝑠1(𝑣 )𝑖={ 1 ,𝑖𝑓 𝑡 𝑖 (𝑣 )>00 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

And

𝑠2(𝑢 ,𝑣 )𝑖={𝑒−Δ𝛼 , 𝑖𝑓 𝑡𝑖 (𝑣 )>0 𝑎𝑛𝑑𝑡𝑖 (𝑣 )<𝑡 𝑖 (𝑢 )

0 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

For I = 1 , …, , where = - and > 0

Page 70: People Search @ Study Group MSRA NLC

P83-70

Weight Edges

Vector informs if node v adopted (or not) each trend of k, while shows if u adopted these trends after v and weights the relation as a function of the period of time between and .

For a fixed , if→ 0+ then → 1 and if → +∞ then → 0.

These limits mean that if the node u adopts a trend just after v then is very close to

Page 71: People Search @ Study Group MSRA NLC

P83-71

Weight Edges

(u, v) = ×

Let be an induced graph of a network G(N,E) over a topic k with trends. For each (u, v) we define ∈the influence of v over u by:

where the operator · refers to the scalar product, ||x|| tothe Euclidian norm of any vector x, and to thenumber of components of (u, v) that are different from 0.If ||s2(u, v)|| = 0, we define (u, v) = 0. It is important tonotice that, by definition, ||s1(v)|| 0 for all v .∈

𝐿(𝑠2(𝑢 ,𝑣))𝑁 𝑘

𝑠1(𝑣 )·  𝑠2(𝑢 ,𝑣)    

||  𝑠1(𝑣) ||  ×  ||  𝑠2(𝑢 ,𝑣)  || 

Page 72: People Search @ Study Group MSRA NLC

P83-72

One important fact is that u can be influenced to adopt atrend of k by several nodes in . So, we normalize(u, v) as follows:

Definition(u, v) =

Normalize

Page 73: People Search @ Study Group MSRA NLC

P83-73

TS Ranking

Definition

The trendsetters (TS) rank of node v in a network ,

denoted by , is given by:

= d *

where 0 ≤ d ≤ 1 is the damping factor and is a probability distribution over all nodes of . In this paper, we consider a uniform (v) = 1/| | for all v , but this distribution could be topic dependent.∈

Page 74: People Search @ Study Group MSRA NLC

P83-74

Evaluation

Page 75: People Search @ Study Group MSRA NLC

P83-75

Baseline

In-degree rankingPageRank

Page 76: People Search @ Study Group MSRA NLC

P83-76

Dataset

Twitter until August 2009.

Over 50 Millions users with all their followers and followees.1.6 Billions tweets

We use #tags as trends.

Page 77: People Search @ Study Group MSRA NLC

P83-77

Example:Iran Elections on Twitter

Page 78: People Search @ Study Group MSRA NLC

P83-78

Example

Iran Elections: {#iran, #iranelections,#tehran}

TS : @Lara (“Reporting from the Middle East”)PR : @cnnbr (“CNN Breaking News”)

Page 79: People Search @ Study Group MSRA NLC

P83-79

Category #Topics ExampleofHashtags #Tweets

Celebrity 16 #michaeljackson,#niley 1,036,101

Games 13 #mafiawars,#ps3# 2,556,437

Idioms 35 #musicmonday,#followfriday 7,882,209

Movies 29 #heroes,#tv 1,769,945

Music 33 #lastfm,#musicmonday 2,785,522

None 153 #quotes,#sale 2,227,971

Political 39 #honduras,#Iranelection, 8,156,786

Sports 27 #soccer,#rugby 1,914,061

Technology 41 #twitter,#android 7,459,471

BaselineWe use the #tag classification made by Romero et al.

Page 80: People Search @ Study Group MSRA NLC

P83-80

Trendsetters: early adopters?

Page 81: People Search @ Study Group MSRA NLC

P83-81

% o

f To

p−10

0 U

sers

bef

ore

the

peak

IDIOMS GAMES POLITICAL NONE MOVIES TECHNO. SPORTS0

CELEBRITY MUSIC

90

Category

Experiments I

100InDegree

PageRank

TrendSetters

80

70

60

50

40

30

20

10

Page 82: People Search @ Study Group MSRA NLC

P83-82

In-degree vs adoption time

Page 83: People Search @ Study Group MSRA NLC

P83-83

Nod

e I

nDeg

ree

−100 −80 −60 −40 −20 0 20

Time

Experiments II

4x 102.5

IDPRTS

2

1.5

1

0.5

0

Page 84: People Search @ Study Group MSRA NLC

P83-84

Influenced Followers Ratio

Page 85: People Search @ Study Group MSRA NLC

P83-85

Influenced Followers Ratio

IFk(v) is the fraction of followers of v that adopted at least one trendof the topic k after v.

CategoryPOLITICALCELEBRITYMUSICGAMESSPORTSIDIOMSNONETECHNOLOGYMOVIES

(%)ID0.0130.0150.0130.0220.0040.0010.0110.0060.006

(%)PR0.0840.0890.0960.0580.0540.0340.0010.0540.043

(%)TS0.1740.1480.1600.1150.0980.0880.0850.0780.067

Page 86: People Search @ Study Group MSRA NLC

P83-86

Num

ber

of T

op−

100

user

s f

ound

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.1

70

60

50

40

30

20

10

Ranking with Partial Information

100

90

80

Ratio of users considered (sorted by time)

musicmonday TSmusicmonday PRiranelection TSiranelection PRSwineflue TSswineflue PRfollowfriday TSfollowfriday PRmw2 TSmw2 PRfb TSfb PRf1 tsf1 prmichaeljackson tsmichaeljakson pr

Page 87: People Search @ Study Group MSRA NLC

P83-87

Final Remarks

Usually, follower hubs (celebrities) are late adopters.

Trendsetters have lower in-degree, but they spread new ideas.

Page 88: People Search @ Study Group MSRA NLC

P83-88

Questions