bdtc2015-新加坡管理大学-朱飞达

26
大数据与金融创新:从研究到实践 Assistant Professor DS Lee Foundation Fellow School of Information Systems Singapore Management University Dec. 11, 2015 朱飞达 Feida Zhu Founding Director Pinnacle Lab for Analytics DBS-SMU Lab for Life Analytics Singapore Management University

Upload: jerry-wen

Post on 12-Feb-2017

154 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

Assistant Professor DS Lee Foundation Fellow

School of Information Systems Singapore Management University

Dec. 11, 2015

朱飞达 Feida Zhu Founding Director

Pinnacle Lab for Analytics DBS-SMU Lab for Life Analytics

Singapore Management University

Page 2: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

企业痛点 1. 经济下行,市场竞争压力大,金融行业已经不同满足于传统的被动服务,而需要全方位了解 用户,把高效便利的金融服务渗透入“医,食,住,行,玩,学”各个生活场景中,提升用户体验,把产品服务嵌入用户生活各个侧面。

2. 企业内部数据对用户了解有局限,用户数据来源不足,且获取的合法性,可持续性,隐私保护性值得担忧。

3. 达到用户的“最后一公里”渠道匮乏,传统营销手段日益疲软(营销骚扰电话等),难以构建用户数据从收集,分析,到最后营销的闭环。

金融创新的一个角度:生活即金融  

Page 3: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

大数据的三大价值  

– Insight  from  scale  • What  can  big  data  tell  us  that  small  data  cannot?  

– Knowledge  from  enrichment  • What  important  knowledge  can  we  learn  from  enriching  small  data  with  big  data?  

– Agility  from  real-­‐?me  responsiveness  • What  are  the  values  of  being  real-­‐?me?  

 

   

VOLUME

VARIETY

VELOCITY

Page 4: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

外部大数据到底能为企业提供什么价值? 企业内部数据

通常只以交易纪录为基础  Transac?on-­‐based  

总量和覆盖有限  Limited  coverage    

只反映用户生活的局部和侧面  Fragmented  par?al  perspec?ve  

静态,低频  Sta?c,  low  frequency  

孤立单一的客户视图,只见人  Isolated  view  of  individual  user    

外部社交媒体大数据 能展现交易行为的上下文场景  Context-­‐based  

海量的社会级覆盖  Societal  scale  

提供用户的多角度全景式洞察  Mul?-­‐facet  insight  

动态实时,高频  Dynamic,  high  frequency  

能综合考虑丰富真实社交关系  Network-­‐embedded  user  view  

Page 5: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

“以人为本”的三个打通

跨外部数据平台用户身份归一 运用独有算法技术识别同一个用户在不同外部数据平台上的不同账号(即便是使用不同用户名),把各个平台的数据以自然人为单位整合到同一用户。目前我们有超过1.5亿中国用户跨多个核心平台的数据。

内外部用户数据匹配 对企业内部客户和外部数据平台用户建立身份匹配

基于大数据的360度全方位动态客户视图 对企业内部客户提供全景式客户画像,动态追踪客户生活和潜在

需求,及时捕捉销售和服务最佳时机和方式。

1

2

3用户兴趣画像 真实生活社交网络 产品倾向性模型

1

2

企业内部数据

3

Page 6: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

应用案例-精准营销

§  海量数据 §  2亿潜在客户的巨大候选空间

§  应用场景:保险业务员每天需要联系大量用户来推销各种保险,如何及时找到目标客户以及最适合这个客户的保险产品?

§  精准目标:基于大数据挖掘,自然语言处理,网络结构分析的准确客户画像

§  及时推送:动态监听客户,定义最佳营销时间点,实时响应潜在需求

Page 7: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

应用案例-精准营销

从1.5亿多潜在客户的海量数据巨大候选空间中进行实时排名,选出排名最高50名

(1)精准目标 基于自然语言处理和文本挖掘,机器自动给用户文本打标签,如“孩子”

(3)及时推送 动态监听客户,发现兴趣的增长趋势,在最佳营销时间点实时响应潜在需求

我今天要联系50个潜在客户卖“少儿险”,告诉我打哪些

电话?  

为什么是这些人?  

你今天最应该打这些人的电

话!  

(2)关系网络分析 基于线下人际关系挖掘,她的周围人,亲密好友也很关心孩子

因为这些人最关心孩子,而且最近有增长趋势,现在是最好时机!  

Page 8: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

应用案例-关系营销和风控 §  应用场景:银行时刻在关心两类人:高风险客户和高净值客户,如何利用客户之间的人际关系顺藤摸瓜找到其他潜在相关客户?

§  海量线下人际关系网络 §  3亿人,60亿条人际关系边组成的巨大关系网

豪车

高尔夫

游艇

赌博

高价值

§  线下人际关系:通过线下亲密关系来顺藤摸瓜找到其他相关目标客户

§  精准客户画像:基于大数据挖掘,自然语言处理的准确客户画像

§  人工智能挖掘 §  从用户外部大数据中自动挖掘出用户的线下真实人际关系网络

Page 9: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

研究课题-线下关系挖掘

!

"!

#!

$!

%!

&!

!"#$%

&'()'$%*+',"-,-(.'$/0,0*%.0-1.$+,2#%#*3,

('*)4*5.3.%1

'()*+,-.(*./0)11-23+401(34+,5

'()*+,-.(*./0)+23+401(34+,5

Figure 1. Mutual Reachability.

!

"!!

#!!

$!!

%!!

&!!

'!!

(!!

! &!! "!!! "&!!

!"#$"%&#'()*+&'+,"#$$*-.(/"

$&./()0

1.2/"#$"3*,#4"(/.%,5#&,##)

Figure 2. Friendship Retainability.

!

"

#

$%

$&

%!

!'"(!'")

!'")(!')

!')(!'))

!'))(!'&

!'&(!'&)

!'&)(!'*

!'*(!'*)

!'*)(!'#

!'#(!'#)

!'#)(!'+

!'+(!'+)

!'+)($'!

!"#$%

&'!

Figure 3. Community Affinity.

nity members who do not have direct two-way followlinks with the target user yet do have strong connec-tions with other off-line community members. The fol-lowing experiment further illustrates the principle. Foreach of our 65 Twitter target user u, we examine eachuser v in u’s neighborhood, and count how many off-linefriends of u have direct two-way follow links with v. Werank them by the count and compute AUC(Area UnderROC Curve) of the rank list based on the ground-truthoff-line friends of u. Figure 3 shows that for most users(52 out of 65), the AUC value is greater than 0.8. Thismeans off-line friends indeed share more direct two-wayfollow links with other off-line friends, exhibiting muchstronger community affinity than online friends. Herewe use direct two-way follow links as an indication ofgreater connection strength.

Principle 3. Community Affinity. Given a tar-get user u, for a user v ∈ Nk

u , let S = {w|w ∈ Cu!Nk

v },the larger the cardinality of S, the more likely we havev ∈ Cu with respect to Nk

u .

ALGORITHMWith incorporating the three principles, we proposeour algorithm based on the idea of random walk withrestart(RWR). It is defined in [6] with the followingequation.

ri = (1− c)W ri + cei (1)

In our problem setting, given the Twitter network G =(V,E), a target user u ∈ V and a number k, we focus onG’s subgraph Gk

u induced by Nku

"{u} , which is sim-

plified as Gu when k is fixed. A probability transitionmatrix W is defined for V (Gu) such that, for two nodesv, w ∈ V (Gu), the entry W (v, w) denotes the probabil-ity of v transmitting to w at any step. In accordancewith Principle (II), we define W (v, w) as

W (v, w) =

# 1|F 1

v→| if w ∈ F 1v→

0 if w ∈ F 1v→

(2)

In Equation 1, W is the transpose of the probabilitytransition matrix W as defined above. ei is the starting

indicator vector such that ei,i = 1 and ei,j = 0 wherei = j. ri is the probability vector for node i such thatri,j is the probability of transmitting to node j from i. cis restart probability. It has been shown that ri can becomputed iteratively and it finally converges[6]. Whenit converges, the steady-state probability vector ri re-flects the bandwidth of information flow originated fromuser i to user j for every j ∈ V (Gu). We use this steady-state probability to define the closeness score ci,j fortwo users i and j:

ci,j = ri,j ∗ rj,i (3)

The closeness score thus defined satisfies Principle (I).We next explore how to take advantage of the off-linecommunity to identify other unknown members, imple-menting Principle (III). The idea is to discover the off-line community iteratively, adding new members intothe known set in each round. For that purpose, weintroduce an auxiliary dummy node, v, to provide athreshold to cut the new off-line community bound-ary for each round. v is constructed as a virtual nodesuch that (I). v and the target user u follow each other,i.e., v ∈ F 1

u←!

F 1u→, (II). v only associates with u,

i.e., for each v ∈ (Nku \ {u}), v ∈ (F 1

v←"F 1v→), and

(III). the number of followers of v is set to be the me-dian of the number of followers of all users in u’s k-hop network with the hub users excluded, i.e., |F 1

v→| =medianv∈(Nk

u\H){|F 1v→|}. Hub users, denoted as H, re-

fer to those accounts with more than 2000 followers,which typically belong to celebrities, news media, etc.The dummy node is defined in such a way as to set thelower-bound case for an off-line friend. It simulates thescenario in which the target user u finds by chance thisrandom user v who has no connections with u’s off-linecommunity. Finding him/her interesting, u follows v,who then also follows back somehow. As such, v repre-sents a connection to u almost as weak as any off-linereal-life friend should be.

On a high level, the algorithm works in iterations asfollows. Given a target user u, compute the closenessscore between u and all the other users as well as v. Aranking list of all the users together with v in decreas-ing order of the closeness score is thus generated. All

Problem:    Given  a  TwiNer  follow  network  of  a  target  user,  iden?fy  the  user’s  offline  community  by  examining  the  follow  linkage  alone.    

Informa.on  should  be  able  to  flow  in  both  direc.ons  within  a  small  distance  

between  real-­‐life  friends.      

Principle I: Mutual Reachability

Principle II: Friendship Retainability

!

"!

#!

$!

%!

&!

!"#$%

&'()'$%*+',"-,-(.'$/0,0*%.0-1.$+,2#%#*3,

('*)4*5.3.%1

'()*+,-.(*./0)11-23+401(34+,5

'()*+,-.(*./0)+23+401(34+,5

Figure 1. Mutual Reachability.

!

"!!

#!!

$!!

%!!

&!!

'!!

(!!

! &!! "!!! "&!!

!"#$"%&#'()*+&'+,"#$$*-.(/"

$&./()0

1.2/"#$"3*,#4"(/.%,5#&,##)

Figure 2. Friendship Retainability.

!

"

#

$%

$&

%!

!'"(!'")

!'")(!')

!')(!'))

!'))(!'&

!'&(!'&)

!'&)(!'*

!'*(!'*)

!'*)(!'#

!'#(!'#)

!'#)(!'+

!'+(!'+)

!'+)($'!

!"#$%

&'!

Figure 3. Community Affinity.

nity members who do not have direct two-way followlinks with the target user yet do have strong connec-tions with other off-line community members. The fol-lowing experiment further illustrates the principle. Foreach of our 65 Twitter target user u, we examine eachuser v in u’s neighborhood, and count how many off-linefriends of u have direct two-way follow links with v. Werank them by the count and compute AUC(Area UnderROC Curve) of the rank list based on the ground-truthoff-line friends of u. Figure 3 shows that for most users(52 out of 65), the AUC value is greater than 0.8. Thismeans off-line friends indeed share more direct two-wayfollow links with other off-line friends, exhibiting muchstronger community affinity than online friends. Herewe use direct two-way follow links as an indication ofgreater connection strength.

Principle 3. Community Affinity. Given a tar-get user u, for a user v ∈ Nk

u , let S = {w|w ∈ Cu!Nk

v },the larger the cardinality of S, the more likely we havev ∈ Cu with respect to Nk

u .

ALGORITHMWith incorporating the three principles, we proposeour algorithm based on the idea of random walk withrestart(RWR). It is defined in [6] with the followingequation.

ri = (1− c)W ri + cei (1)

In our problem setting, given the Twitter network G =(V,E), a target user u ∈ V and a number k, we focus onG’s subgraph Gk

u induced by Nku

"{u} , which is sim-

plified as Gu when k is fixed. A probability transitionmatrix W is defined for V (Gu) such that, for two nodesv, w ∈ V (Gu), the entry W (v, w) denotes the probabil-ity of v transmitting to w at any step. In accordancewith Principle (II), we define W (v, w) as

W (v, w) =

# 1|F 1

v→| if w ∈ F 1v→

0 if w ∈ F 1v→

(2)

In Equation 1, W is the transpose of the probabilitytransition matrix W as defined above. ei is the starting

indicator vector such that ei,i = 1 and ei,j = 0 wherei = j. ri is the probability vector for node i such thatri,j is the probability of transmitting to node j from i. cis restart probability. It has been shown that ri can becomputed iteratively and it finally converges[6]. Whenit converges, the steady-state probability vector ri re-flects the bandwidth of information flow originated fromuser i to user j for every j ∈ V (Gu). We use this steady-state probability to define the closeness score ci,j fortwo users i and j:

ci,j = ri,j ∗ rj,i (3)

The closeness score thus defined satisfies Principle (I).We next explore how to take advantage of the off-linecommunity to identify other unknown members, imple-menting Principle (III). The idea is to discover the off-line community iteratively, adding new members intothe known set in each round. For that purpose, weintroduce an auxiliary dummy node, v, to provide athreshold to cut the new off-line community bound-ary for each round. v is constructed as a virtual nodesuch that (I). v and the target user u follow each other,i.e., v ∈ F 1

u←!

F 1u→, (II). v only associates with u,

i.e., for each v ∈ (Nku \ {u}), v ∈ (F 1

v←"F 1v→), and

(III). the number of followers of v is set to be the me-dian of the number of followers of all users in u’s k-hop network with the hub users excluded, i.e., |F 1

v→| =medianv∈(Nk

u\H){|F 1v→|}. Hub users, denoted as H, re-

fer to those accounts with more than 2000 followers,which typically belong to celebrities, news media, etc.The dummy node is defined in such a way as to set thelower-bound case for an off-line friend. It simulates thescenario in which the target user u finds by chance thisrandom user v who has no connections with u’s off-linecommunity. Finding him/her interesting, u follows v,who then also follows back somehow. As such, v repre-sents a connection to u almost as weak as any off-linereal-life friend should be.

On a high level, the algorithm works in iterations asfollows. Given a target user u, compute the closenessscore between u and all the other users as well as v. Aranking list of all the users together with v in decreas-ing order of the closeness score is thus generated. All

The  size  of  a  user’s  offline  community  has  an  upper-­‐bound  threshold  σ  related  to  Dunbar’s  number  

Principle III: Community Affinity

Figure 6: Case study of a user’s follow network.

5. EXPERIMENTAL STUDYAn implementation of our algorithm as a demo system –

TwiCube1 – is publicly available.

5.1 Case StudyWe now present a case study on a real user X who par-

ticipated in our evaluation. X has 107 followers and follows385 other users. Figure 6 illustrates the discovery of his corecommunity in a total of 4 iterations each indicated by a dif-ferent color. In summary, 34 users are identified in Iteration1, 19 in Iteration 2, 3 in Iteration 3 and only one user inthe last iteration. The precision and recall for this resultof X’s core community is 0.8947 and 0.9807 respectively. Itcan be observed from Figure 6 that there is a dense clustersof core community members heavily linked among one an-other (lower left to X) and another such cluster of non-core-community users similarly linked (upper right to X). Thisshows that approaches based on dense subgraph mining orstructural clustering would have a hard time in distinguish-ing between these two similarly-structured communities and,consequently, identifying the true core community. In fact,this cluster of non-core-community users consists of media,business and active Twitter users sharing similar interestsand topics, which is a good indicator of those of X’s own.In Figure 6, we pick out two particular users, magnify

their follow links with X and present them in two cases (a)and (b) (marked by arrows in the figure). In (a), we showthe follow network between X and a non-core-communityuser“tuniu”, which is a travel business. Note that althoughX and this business node directly follow each other, satis-fying our Principle 1, this node is still correctly excludedfrom the core community by our algorithm. This is mainlybecause it connects mostly with other non-core-communityusers by follow links, exhibiting weak core community affin-

1http://twitterbud2011.appspot.com/

ity withX. This case would fail the naive approach trying toidentify core community members by two-way follow links.In (b), we show the follow networks between X and a corecommunity member Y , who is discovered in Iteration 3. Inthis case, X follows Y but Y does not follow X. Moreover, itis not until more core community members have been iden-tified at Iteration 1 and 2 that Y ’s sophisticated connectionswith the core community are revealed. In this tricky case,by unleashing the power of iterated core community identi-fication, our algorithm is still able to correctly identify Y .

5.2 EffectivenessOne naive method to identify the core community of a tar-

get user u is to find the set of users who have direct two-wayfollow links with u, i.e., they and u follow each other. Do di-rect two-way follow links provide good indication for off-linereal-world friendship? Our experiments suggest that theselinks are not sufficient. In Figure 7 we show the comparisonon the distribution (among the 65 user evaluations)of pre-cision, recall and F score between our algorithm CCD andthe naive algorithm. In general our solution outperformsthe naive solution by a large margin. To conduct more de-tailed comparison between the two methods, let’s take acloser examination at each user. We compute the differenceof precision and recall between two solutions for each user.In Figure 8, each point represents one user and the coordi-nate is defined as (PCCD − Pnaive, RCCD − Rnaive) wherePCCD and RCCD is the precision and recall of our algorithmrespectively, Pnaive and Rnaive is the precision and recall ofthe naive approach respectively. The result shows that formost users, our solution outperforms the naive solution forboth precision and recall. In particular, in two cases, thedifference is even close to 1. There is only one single casein which our algorithm is prevailed for both precision andrecall.

A  user’s  off-­‐line  friends  usually  group  into  clusters  within  which  

members  know  each  other    

Page 10: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

研究课题-线下关系挖掘

Figure 6: Case study of a user’s follow network.

5. EXPERIMENTAL STUDYAn implementation of our algorithm as a demo system –

TwiCube1 – is publicly available.

5.1 Case StudyWe now present a case study on a real user X who par-

ticipated in our evaluation. X has 107 followers and follows385 other users. Figure 6 illustrates the discovery of his corecommunity in a total of 4 iterations each indicated by a dif-ferent color. In summary, 34 users are identified in Iteration1, 19 in Iteration 2, 3 in Iteration 3 and only one user inthe last iteration. The precision and recall for this resultof X’s core community is 0.8947 and 0.9807 respectively. Itcan be observed from Figure 6 that there is a dense clustersof core community members heavily linked among one an-other (lower left to X) and another such cluster of non-core-community users similarly linked (upper right to X). Thisshows that approaches based on dense subgraph mining orstructural clustering would have a hard time in distinguish-ing between these two similarly-structured communities and,consequently, identifying the true core community. In fact,this cluster of non-core-community users consists of media,business and active Twitter users sharing similar interestsand topics, which is a good indicator of those of X’s own.In Figure 6, we pick out two particular users, magnify

their follow links with X and present them in two cases (a)and (b) (marked by arrows in the figure). In (a), we showthe follow network between X and a non-core-communityuser“tuniu”, which is a travel business. Note that althoughX and this business node directly follow each other, satis-fying our Principle 1, this node is still correctly excludedfrom the core community by our algorithm. This is mainlybecause it connects mostly with other non-core-communityusers by follow links, exhibiting weak core community affin-

1http://twitterbud2011.appspot.com/

ity withX. This case would fail the naive approach trying toidentify core community members by two-way follow links.In (b), we show the follow networks between X and a corecommunity member Y , who is discovered in Iteration 3. Inthis case, X follows Y but Y does not follow X. Moreover, itis not until more core community members have been iden-tified at Iteration 1 and 2 that Y ’s sophisticated connectionswith the core community are revealed. In this tricky case,by unleashing the power of iterated core community identi-fication, our algorithm is still able to correctly identify Y .

5.2 EffectivenessOne naive method to identify the core community of a tar-

get user u is to find the set of users who have direct two-wayfollow links with u, i.e., they and u follow each other. Do di-rect two-way follow links provide good indication for off-linereal-world friendship? Our experiments suggest that theselinks are not sufficient. In Figure 7 we show the comparisonon the distribution (among the 65 user evaluations)of pre-cision, recall and F score between our algorithm CCD andthe naive algorithm. In general our solution outperformsthe naive solution by a large margin. To conduct more de-tailed comparison between the two methods, let’s take acloser examination at each user. We compute the differenceof precision and recall between two solutions for each user.In Figure 8, each point represents one user and the coordi-nate is defined as (PCCD − Pnaive, RCCD − Rnaive) wherePCCD and RCCD is the precision and recall of our algorithmrespectively, Pnaive and Rnaive is the precision and recall ofthe naive approach respectively. The result shows that formost users, our solution outperforms the naive solution forboth precision and recall. In particular, in two cases, thedifference is even close to 1. There is only one single casein which our algorithm is prevailed for both precision andrecall.

Figure 5: Core Community Discovery

set in each round. For that purpose, we introduce an auxil-iary dummy node, v, to provide a threshold to cut the newcore community boundary for each round. v is constructedas a virtual node such that (I). v and the target user u followeach other, i.e., v ∈ Fu←

!Fu→, (II). v only associates with

u, i.e., for each v ∈ (Nku \ {u}), v ∈ (Fv←

"Fv→), and (III).

the number of followers of v is set to be the median of thenumber of followers of all users in u’s k-hop network with thehub users excluded, i.e., |Fv→| = medianv∈(Nk

u\H){|Fv→|}.This dummy node is defined in such a way as to set the lower-bound case for an off-line friend. It simulates the scenarioin which the target user u finds by chance this random userv who has no connections with u’s core community. Findinghim/her interesting, u follows v, who then also follows backsomehow. As such, v represents a connection to u almost asweak as any off-line real-life friend should be. Therefore, ifthe closeness score between u and any user w is even lowerthan that between u and v, w is highly unlikely to be inu’s core community. In Section 5, we show that in fact ouralgorithm is fairly robust with respect to the choice of v’sfollower number.On a high level, the algorithm works in iterations as fol-

lows. Given a target user u, compute the closeness scorebetween u and all the other users as well as v. A rankinglist of all the users together with v in decreasing order of thecloseness score is thus generated. All the users ranked be-fore v are identified as core community members, which endsthe current iteration. In the next iteration, the key point isthat we now treat the whole core community identified sofar as one virtual user node u. Instead of computing thecloseness score between u and all the rest users, this timewe compute the closeness score between u and every otheruser. From the ranking list thus generated, if any user jumpsahead of v in this iteration, the user will be added to thecore community of u, which ends this iteration. So on andso forth. Figure 5 illustrates the process. The target user uis shown in red in the center and the auxiliary dummy nodev is shown in purple. In iteration 1, the core community isjust u itself, which is indicated by the shaded circle coveringu. The highlighted blue nodes and follow links representsFu←

"Fu→. After computing the closeness score cu,v for all

v, three users are found to be ahead of v in the resultingranking list. They are therefore added to the core commu-nity, indicated by their color changed from blue to orange.In iteration 2, we use the new core community u, consistingnow of 4 users, to compute the closeness scores cu,v for allrest nodes v. Those ranked ahead of v will be added to thecore community. The iterations continue until no new usercan be added to the core community, ending the algorithm.As the virtual user node u is actually a set, we now define

RWR and closeness score between a user node i and a set Sas follows.

ri,S =#

j∈S

ri,j (4)

rS,i =#

j∈S

rj,i (5)

ci,S = cS,i = ri,S ∗ rS,i (6)

Given a user node i, the probability transition matrix W ,the restart probability c and a tolerant threshold ϵ, the al-gorithm for computing ri is given in Algorithm 1.

Algorithm 1 NodeRWR

Input: node i, probability transition matrix Wrestart probability c, tolerant threshold ϵ

Output: ri1: Initialize ri ← ei;2: Do

3: r′i ← (1− c)W ri + cei;

4: △ri ← r′i − ri;

5: ri ← r′i;6: While |△ri| > ϵ7: Return ri;

Algorithm 2 CoreCommunityDiscovery(CCD)

Input: target node u, network Nu, restart probability cand tolerant threshold ϵ

Output: core community Cu, iteration register vector irand closeness score cu

1: add auxiliary dummy node v into network Nu;2: construct W from network Nu by Equation 2;3: For each v ∈ Nu

4: rv ← NodeRWR(v,W, c, ϵ);5: For each v ∈ Nu \ {u}6: cu,v ← ru,v ∗ rv,u;7: t ← 0;Cu ← {u}; ir ← 0;8: Do9: t ← t+ 1;10: T ← ∅;11: For each v ∈ Nu

12: If v ∈ Cu and cCu,v > cCu,v

13: irv ← t;14: T ← T

"{v};

15: Cu ← Cu"

T ;16:While |T | > 017:Return Cu, ir, cu;

Algorithm 2 iteratively finds the core community for atarget user u. At Line 1, we add an auxiliary dummy nodev into the network to help us to set the cut-off threshold foreach iteration. Line 2 constructs the probability transitionmatrix for RWR. From Line 3 to Line 6, we compute thecloseness score cu between u and rest of the nodes in Nu,and generates a ranked list. From Line 7 to Line 16, wecompute the core community Cu and, for each core com-munity member, maintain in ir the iteration in which it isidentified.

Wei Xie, Cheng Li, Feida Zhu, Ee-Peng Lim

Case Study

When a Friend in Twitter is a Friend in Life

Twitter Off-line Community

Approach

Three Principles

Model Accuracy

Principle I: Mutual Reachability Information should be able to flow in both directions between real-life friends.

Principle II: Friendship Retainability In general, the number of real-life close friends of any user should have a reasonable upper-bound.

Principle III: Community Affinity A user’s real-life friends usually group into clusters within each of which the members also know each other personally.

!  Twitter follow network is formed in a unique way. !  How much does a user’s Twitter follow network reflect his/her offline real-life social network? !  We call it Twitter Off-line Community the portion of a user’s follow network which maps to the user’s off-line social network. !  The ability to identify a user’s Twitter off-line community is important in understanding user online social behavior, building accurate and robust user interest profile and better content recommendation.

We define a hub user as a user with more than 2000 followers.The set of all hub users in Nu is denoted as H.

3. CORE COMMUNITY CHARACTERIZA-TION

In order to identify the core community of a user u, weneed to understand the difference between a user v ∈ Cu

and a user v′ ∈ Cu. Three principles play important rolesin characterizing a user in the core community. The firstprinciple is Mutual Reachability.

Principle 1. Mutual Reachability. Given a targetuser u, for any user v ∈ Cu with respect to Nk

u , we shouldhave v ∈ Nk

u→!

Nku←.

Principle 1 is based on the simple observation that informa-tion should be able to flow in both directions between tworeal friends. The follow link between two users on Twitteronly indicates a one-way information flow from the followeeto follower, i.e., if u ← v, while all v’s tweets are delivered tou, those of u’s are not automatically visible to v. Principle 1in this case translates into requiring both u and v are in eachother’s k-hop followee network and k-hop follower networksimultaneously.

The second principle is Friendship Exclusivity.

Principle 2. Friendship Exclusivity. Given a targetuser u, for any user v ∈ Cu with respect to Nk

u such thatk is a small number, e.g., k = 1 or k = 2, we should have|Nk

v→!

Nkv←| ≤ σ where σ is a upper-bound threshold mea-

suring friendship exclusivity.

Principle 2 says that, in general, the number of real-life closefriends of any user should have a reasonable upper-bound.Exceeding the bound indicates violation of exclusivity, whichinvites serious doubt upon the strength of friendship be-tween the two parties. Note that we impose a small value fork in this case such that the exclusivity is checked upon theset of users enjoying mutual reachability among the targetuser’s immediate follow network. We consider these users,most of whom often connected to the target user by two-way follow links, reasonable candidates for real-life off-linefriends.

The third principle is Community Affinity

Principle 3. Community Affinity. Given a target useru, for a user v ∈ Nk

u , let S = {w|w ∈ Cu!

Nkv←

!Nk

v→},the larger the cardinality of S, the more likely we have v ∈Cu with respect to Nk

u .

Principle 3 recognizes the importance of using a user’s already-identified partial core community in judging whether a givenuser belongs to the core community as well. This princi-ple is based on the common observation that a user’s off-line friends usually group into clusters within each of whichmembers also know each other personally. Principle 3 is

useful in identifying those core community members who donot have direct two-way follow links with the target useryet do have strong connections with other core communitymembers, which will be otherwise missed. Such cases areillustrated shortly.

Figure 1: Three Types of Core Community Mem-bers.

We now show how these three principles help us identifycore communities members of different kinds. Based on ourstudy, we categorize a user’s follow network based on threeattributes each reflects one of the above-mentioned princi-ples. Note that these attributes and their correspondingparameters are proposed for the categorization only, none ofwhich will be actually computed in our algorithm. Supposethe target user is u and the user in consideration is v.

(I) Mutual Following. The first attribute is whether uand v directly follow each other. There are two cases: (I). uand v follow each other, i.e., v ∈ N1

u←!

N1u→. We call this

a two-way follow case. (II). Either u follows v or v followsu, but not both, i.e., v ∈ N1

u←"

N1u→ \ N1

u←!

N1u→. We

call this a one-way follow case. Principle 1 is immediatelysatisfied in a two-way follow case as tweets of both u and vare delivered directly to each other, while in a one-way followcase, computation considering the k-hop neighborhood of uis necessary to determine the satisfiability of Principle 1.

(II) Friendship Exclusivity. The second attribute is thelarger one between |Fu←| and |Fu→|. For simplicity, we use|Fu←| to illustrate while the analysis with |Fu→| can be donesimilarly. This attribute indicates the number of other usersin whom u is interested in hearing about. In general, thisreflects either curiosity in knowing more about that partic-ular followee or eagerness in receiving updates on daily lifefrom that person, both of which are good signs of friendship.Assume two parameters σ1 and σ2 can be estimated empiri-cally, there are three cases to indicate friendship exclusivityfrom high to low. (I) When |Fu←| < σ1, we call it a highlyexclusive case. (II) When σ1 ≤ |Fu←| ≤ σ2, we call it amedium exclusive case. (III) When |Fu←| > σ2, we call it abarely exclusive case.

(III) Community Affinity. The third attribute is whetherv has strong connections with other core community mem-bers of u. We mainly distinguish two cases: (I) strong affin-ity ; (II) weak affinity.

A categorization of different types of users in a target user’sfollow network is shown in Table 1. We use “highly likely”,“maybe” and “unlikely” to indicate the chance of such a userbeing a off-line real-life friend of the target user being high,medium and low respectively. The symbols “

√” and “×”

means a particular principle is satisfied or not respectively.Symbol “?” means the satisfiability has to be judged case

!  Random Walk with Restart

!  Closeness Score

!  Iterative Off-line Community Discovery !  Off-line community is discovered by iterations. !  A virtual user node is used as the threshold to cut

for each iteration.

strong community affinity weak community affinityhighly exclusive medium exclusive barely exclusive highly exclusive medium exclusive barely exclusive

two-way highly likely maybe maybe highly likely maybe unlikelyfollow P1

√, P2

√, P3

√P1

√, P2?, P3

√P1

√, P2×, P3

√P1

√, P2

√, P3× P1

√, P2?, P3× P1

√, P2×, P3×

one-way highly likely maybe unlikely unlikely unlikely unlikelyfollow P1

√, P2

√, P3

√P1

√, P2?, P3

√P1

√, P2×, P3

√P1×, P2

√, P3× P1×, P2?, P3× P1×, P2×, P3×

Table 1: Case Study of Core Community Members

by case. In general, core community members belong toone of the following types, each correspondent to a “highlylikely” cell in Table 1. Figure 1 illustrates these three typesin which u is the target user, the shaded area represents u’score community, w1, w2 and w3 are already-identified corecommunity members. The size of the node is in proportionto the user’s friendship exclusivity — the smaller the size,the higher the exclusivity.

1. Active online, socially discriminating and mu-tually following. As illustrated as type“A” in Figure1, this type of core community members displays thestrongest online social connection with the target user.They directly follow the target user, and vice versa,representing a two-way follow case. In the meantime,they demonstrate a reasonable degree of discrimina-tion by not having a huge number of other users indirect two-way follow case. They are also active on-line, having close connections with other users in thetarget user’s core community. Therefore, these userssatisfy all three principles of Principle 1, 2 and 3

2. Inactive online, socially discriminating and mu-tually following. Not all people are heavy Twitterusers. In fact, many people register a Twitter accountout of curiosity, log in Twitter occasionally ever sinceand respond passively to follow links. Most of theseusers only have a small number of close friends in theirfollow network, and have two-way follow links with al-most all of them. As illustrated as type “B” in Figure1, these users satisfy Principle 1 and 2 but not 3.

3. Active online, socially discriminating and indi-rectly following. As illustrated as type “C” in Fig-ure 1, this is a type of core community members thatare trickier to identify. The fact that there is at mostone-way follow links between the target user and thecore community member easily disguises the off-linefriendship from an unmindful examination. It is onlyby noticing the strong follow connections between thismember and other core community members of thetarget user that the highly likely off-line friendship isrevealed. These users satisfy both Principle 2 and 3,but not 1.

4. ALGORITHMThe analysis in Section 3 leads to the conclusion that anyeffective algorithm for core community identification shouldincorporate the three principles we proposed. In particu-lar, it should be able to (I) tell whether information origi-nated from either user could reach the other party by flowingalong the follow links, (II) give priority to users with higher

friendship exclusivity, and (III) make better use of the con-nections with and among other core community members tomore intelligently measure the user’s likelihood of being acore community member of the target user.

We propose our algorithm based on the idea of random walkwith restart(RWR). RWR has been successfully used to mea-sure the relevance score between two nodes in a weightedgraph [13, 9, 2, 12]. It is defined in [9] with the followingequation.

ri = (1− c)W ri + cei (1)

In this setting, given a weighted graph, a particle starts fromnode i and conducts random movement. It transmits to theneighborhood of its current node with a probability propor-tional to the edge weights. At each step, the particle alsoreturns to the start node i with some probability c. Therelevance score of node j with respect to i is defined as thesteady-state probability ri,j that the particle finally stays atnode j .

In our problem setting, given the Twitter network G =(V,E), a target user u ∈ V and a number k, we focus onG’s subgraph Gk

u induced by Nku , which is simplified as Gu

when k is fixed. A probability transition matrix W is de-fined for Gu(V ) such that, for two nodes v, w ∈ Gu(V ), theentry W (v, w) denotes the probability of v transmitting tow at any step. In accordance with Principle (II), we defineW (v, w) as

W (v, w) =

! 1|Fv→| if w ∈ Fv→

0 if w ∈ Fv→(2)

In Equation 1, W is the transpose of the probability transi-tion matrix W as defined above. ei is the starting indicatorvector such that ei,i = 1 and ei,j = 0 where i = j. ri is theprobability vector for node i such that ri,j is the probabilityof transmitting to node j from i. It has been shown thatri can be computed iteratively and it finally converges toc(I− (1−c)W )−1ei [9]. When it converges, the steady-stateprobability vector ri reflects the bandwidth of informationflow originated from user i to user j for every j ∈ Gu(V ).We use this steady-state probability to define the closenessscore ci,j for two users i and j:

ci,j = ri,j ∗ rj,i (3)

The closeness score thus defined satisfies Principle (I). Ithas the following desirable properties, the proofs of whichare omitted due to space limit.

Property 1. Given a Twitter follow network G(V,E)and two users i, j ∈ V , ci,j is symmetric, i.e., ci,j = cj,i.

strong community affinity weak community affinityhighly exclusive medium exclusive barely exclusive highly exclusive medium exclusive barely exclusive

two-way highly likely maybe maybe highly likely maybe unlikelyfollow P1

√, P2

√, P3

√P1

√, P2?, P3

√P1

√, P2×, P3

√P1

√, P2

√, P3× P1

√, P2?, P3× P1

√, P2×, P3×

one-way highly likely maybe unlikely unlikely unlikely unlikelyfollow P1

√, P2

√, P3

√P1

√, P2?, P3

√P1

√, P2×, P3

√P1×, P2

√, P3× P1×, P2?, P3× P1×, P2×, P3×

Table 1: Case Study of Core Community Members

by case. In general, core community members belong toone of the following types, each correspondent to a “highlylikely” cell in Table 1. Figure 1 illustrates these three typesin which u is the target user, the shaded area represents u’score community, w1, w2 and w3 are already-identified corecommunity members. The size of the node is in proportionto the user’s friendship exclusivity — the smaller the size,the higher the exclusivity.

1. Active online, socially discriminating and mu-tually following. As illustrated as type“A” in Figure1, this type of core community members displays thestrongest online social connection with the target user.They directly follow the target user, and vice versa,representing a two-way follow case. In the meantime,they demonstrate a reasonable degree of discrimina-tion by not having a huge number of other users indirect two-way follow case. They are also active on-line, having close connections with other users in thetarget user’s core community. Therefore, these userssatisfy all three principles of Principle 1, 2 and 3

2. Inactive online, socially discriminating and mu-tually following. Not all people are heavy Twitterusers. In fact, many people register a Twitter accountout of curiosity, log in Twitter occasionally ever sinceand respond passively to follow links. Most of theseusers only have a small number of close friends in theirfollow network, and have two-way follow links with al-most all of them. As illustrated as type “B” in Figure1, these users satisfy Principle 1 and 2 but not 3.

3. Active online, socially discriminating and indi-rectly following. As illustrated as type “C” in Fig-ure 1, this is a type of core community members thatare trickier to identify. The fact that there is at mostone-way follow links between the target user and thecore community member easily disguises the off-linefriendship from an unmindful examination. It is onlyby noticing the strong follow connections between thismember and other core community members of thetarget user that the highly likely off-line friendship isrevealed. These users satisfy both Principle 2 and 3,but not 1.

4. ALGORITHMThe analysis in Section 3 leads to the conclusion that anyeffective algorithm for core community identification shouldincorporate the three principles we proposed. In particu-lar, it should be able to (I) tell whether information origi-nated from either user could reach the other party by flowingalong the follow links, (II) give priority to users with higher

friendship exclusivity, and (III) make better use of the con-nections with and among other core community members tomore intelligently measure the user’s likelihood of being acore community member of the target user.

We propose our algorithm based on the idea of random walkwith restart(RWR). RWR has been successfully used to mea-sure the relevance score between two nodes in a weightedgraph [13, 9, 2, 12]. It is defined in [9] with the followingequation.

ri = (1− c)W ri + cei (1)

In this setting, given a weighted graph, a particle starts fromnode i and conducts random movement. It transmits to theneighborhood of its current node with a probability propor-tional to the edge weights. At each step, the particle alsoreturns to the start node i with some probability c. Therelevance score of node j with respect to i is defined as thesteady-state probability ri,j that the particle finally stays atnode j .

In our problem setting, given the Twitter network G =(V,E), a target user u ∈ V and a number k, we focus onG’s subgraph Gk

u induced by Nku , which is simplified as Gu

when k is fixed. A probability transition matrix W is de-fined for Gu(V ) such that, for two nodes v, w ∈ Gu(V ), theentry W (v, w) denotes the probability of v transmitting tow at any step. In accordance with Principle (II), we defineW (v, w) as

W (v, w) =

! 1|Fv→| if w ∈ Fv→

0 if w ∈ Fv→(2)

In Equation 1, W is the transpose of the probability transi-tion matrix W as defined above. ei is the starting indicatorvector such that ei,i = 1 and ei,j = 0 where i = j. ri is theprobability vector for node i such that ri,j is the probabilityof transmitting to node j from i. It has been shown thatri can be computed iteratively and it finally converges toc(I− (1−c)W )−1ei [9]. When it converges, the steady-stateprobability vector ri reflects the bandwidth of informationflow originated from user i to user j for every j ∈ Gu(V ).We use this steady-state probability to define the closenessscore ci,j for two users i and j:

ci,j = ri,j ∗ rj,i (3)

The closeness score thus defined satisfies Principle (I). Ithas the following desirable properties, the proofs of whichare omitted due to space limit.

Property 1. Given a Twitter follow network G(V,E)and two users i, j ∈ V , ci,j is symmetric, i.e., ci,j = cj,i.

Property 2. Given a Twitter follow network G(V,E),two users i, j ∈ V and k, ci,j > 0 if and only if i and jsatisfy Principle 1 — i ∈ Nk

j→!

Nkj← and j ∈ Nk

i→!

Nki←,

i.e., tweets originated from either user i or j should be ableto reach the other one in k hops.

Property 3. Given a Twitter follow network G(V,E),two users i, j ∈ V and k, obtain a node j′ resulted fromremoving a set S of users from j’s immediate neighborhoodsuch that for each v ∈ S, either v ∈ Fj→ \ Nk

i← or v ∈Fj← \Nk

i→. We have ci,j ≤ ci,j′ .

Figure 2: Core Community Discovery

Property 2 and 3 shows how our closeness score definitionincorporates the first two points as pointed out at the begin-ning of this section. We next explore how to take advantageof the core community to identify other unknown members,implementing Principle (III). The idea is to discover the corecommunity iteratively, adding new members into the knownset in each round. For that purpose, we introduce an auxil-iary dummy node, v, to provide a threshold to cut the newcore community boundary for each round. v is constructedas a virtual node such that (I). v and the target user u followeach other, i.e., v ∈ Fu←

!Fu→, (II). v only associates with

u, i.e., for each v ∈ (Nku \ {u}), v ∈ (Fv←

"Fv→), and (III).

the number of followers of v is set to be the median of thenumber of followers of all users in u’s k-hop network with thehub users excluded, i.e., |Fv→| = medianv∈(Nk

u\H){|Fv→|}.This dummy node is defined in such a way as to set the lower-bound case for an off-line friend. It simulates the scenarioin which the target user u finds by chance this random userv who has no connections with u’s core community. Findinghim/her interesting, u follows v, who then also follows backsomehow. As such, v represents a connection to u almost asweak as any off-line real-life friend should be. Therefore, ifthe closeness score between u and any user w is even lowerthan that between u and v, w is highly unlikely to be inu’s core community. In Section 5, we show that in fact ouralgorithm is fairly robust with respect to the choice of v’sfollower number.

On a high level, the algorithm works in iterations as follows.Given a target user u, compute the closeness score betweenu and all the other users as well as v. A ranking list of allthe users together with v in decreasing order of the close-ness score is thus generated. All the users ranked before vare identified as core community members, which ends thecurrent iteration. In the next iteration, the key point isthat we now treat the whole core community identified sofar as one virtual user node u. Instead of computing the

closeness score between u and all the rest users, this timewe compute the closeness score between u and every otheruser. From the ranking list thus generated, if any user jumpsahead of v in this iteration, the user will be added to thecore community of u, which ends this iteration. So on andso forth. Figure 2 illustrates the process. The target user uis shown in red in the center and the auxiliary dummy nodev is shown in purple. In iteration 1, the core community isjust u itself, which is indicated by the shaded circle coveringu. The highlighted blue nodes and follow links representsFu←

"Fu→. After computing the closeness score cu,v for all

v, three users are found to be ahead of v in the resultingranking list. They are therefore added to the core commu-nity, indicated by their color changed from blue to orange.In iteration 2, we use the new core community u, consistingnow of 4 users, to compute the closeness scores cu,v for allrest nodes v. Those ranked ahead of v will be added to thecore community. The iterations continue until no new usercan be added to the core community, ending the algorithm.As the virtual user node u is actually a set, we now defineRWR and closeness score between a user node i and a set Sas follows.

ri,S =#

j∈S

ri,j (4)

rS,i =#

j∈S

rj,i (5)

ci,S = cS,i = ri,S ∗ rS,i (6)

Given a user node i, the probability transition matrix W ,the restart probability c and a tolerant threshold ϵ, the al-gorithm for computing ri is given in Algorithm 1.

Algorithm 1 NodeRWR

Input: node i, probability transition matrix Wrestart probability c, tolerant threshold ϵ

Output: ri1: Initialize ri ← ei;2: Do

3: r′i ← (1− c)W ri + cei;

4: △ri ← r′i − ri;

5: ri ← r′i;6: While |△ri| > ϵ7: Return ri;

Algorithm 2 iteratively finds the core community for a targetuser u. At Line 1, we add an auxiliary dummy node v intothe network to help us to set the cut-off threshold for eachiteration. Line 2 constructs the probability transition matrixfor RWR. From Line 3 to Line 6, we compute the closenessscore cu between u and rest of the nodes inNu, and generatesa ranked list. From Line 7 to Line 16, we compute thecore community Cu and, for each core community member,maintain in r the iteration in which it is identified.

5. EXPERIMENTAL STUDY5.1 DataTo provide ground-truth evaluation for our algorithm, wehired 65 real Twitter users from different countries to par-ticipate in our user assessment test. Figure 3 shows the

Figure 4: Case study of a user’s follow network.

Figure 6: The relative result of two solutions.

of precision and recall between two solutions for each user.In Figure 6, each point represents one user and the coordi-nate is defined as (PCCD − Pnaive, RCCD − Rnaive) wherePCCD and RCCD is the precision and recall of our algorithmrespectively, Pnaive and Rnaive is the precision and recall ofthe naive approach respectively. The result shows that formost users, our solution outperforms the naive solution forboth precision and recall. In particular, in two cases, thedifference is even close to 1. There is only one single casein which our algorithm is prevailed for both precision andrecall.

5.4 On Ranking

Besides identifying a core community through iterations, ouralgorithm also generates a closeness ranking of all users inthe follow network for the target user. Compared against thecore community found by a clear-cut threshold, this rankingin many cases could be just as useful. For example, whenrecommending users you have not yet follow, recommendingthose ranked high in this ranking could be safe. The rankingis based on the closeness score computation in Algorithm 2.For a target user u, we can use the following function tocompare two users:

compare1(v1, v2) =

⎧⎪⎨

⎪⎩

1, cu,v1 − cu,v2 > 0

0, cu,v1 − cu,v2 = 0

−1, cu,v1 − cu,v2 < 0

(7)

Alternatively, iteration information, e.g., in which iterationthe user is identified, could be incorporated into the com-parison as follows.

compare2(v1, v2) =

⎧⎪⎨

⎪⎩

1, rv1 − rv2 < 0

compare1(v1, v2), rv1 − rv2 = 0

−1, rv1 − rv2 > 0

(8)

Which one is better? We evaluate these two rankings bycomputing their AUC value for each users. The distribu-tions of the AUC values are showed in Figure 7. The resultsshows that for both rankings, more than 60% users’ AUCvalues are greater than 0.9 and more than 80% users’ AUC

A real Twitter user: following 385; followers 107

Figure 5: Comparison on distribution of precision, recall and F score.

Figure 7: AUC comparison for rankings with and without incorporating iteration information.

values are greater than 0.8. The right graph in Figure 7shows that in most cases, the ranking with iteration informa-tion incorporated is superior than the ranking based solelyon closeness score. This demonstrates that core communityinformation helps the ranking.

5.5 On IterationIt has been observed in our experiments that the core com-munity discovery process ends after a few iterations. Oneinteresting question is whether core community membersidentified in later iterations are as good as those found inearlier iterations. If we set a maximum number of iterationallowed in the algorithm to force termination, will the resultgive better precision and recall? Our experiments suggesta negative answer. Figure 8 shows that the average pre-cision, recall and F-score for varied maximum number ofiterations allowed from 1 to 10 as well as unlimited. As themaximum number of iterations allowed increases, althoughaverage precision drops slightly, recall improves significantly,and so does the F-score. Intuitively, earlier iterations tendto capture those closest members to the target user, whichresults in a higher precision yet at the cost of missing outmany other core community members with more sophisti-cated social connections with the target user. By setting nomaximum number of iterations and allowing the core com-munity itself to take shape, much greater gain in recall couldbe achieved, offering a better result overall. In most cases,core communities stabilize after 5 or 6 iterations, as shownin Figure 9 which presents the distribution of number of

iterations of all our evaluation participants.

5.6 Modeling User InterestsHow to model user interests is of critical importance in con-tent recommendation and linkage prediction in Twitter data.Furthermore, our study reveals that core community discov-ery could significantly enhance user interest modeling in thefollowing two aspects: (I) For a target user u, its core com-munity members themselves are less informative in charac-terizing u’s interests than the rest user nodes in the follownetwork. u follow them mostly because they are off-line real-life friends anyway. On the other hand, it is similar interestsor topics that drive u to follow other non-core-communityusers. As such, when investigating u’s interests, the firststep is to distinguish u’s core community from the rest fol-low network. (II). Although the core community membersthemselves may not necessarily reflect u’s interests, thoseusers followed by these core community members neverthe-less could help understand u’s interests, e.g., close friendscould follow media/celebrity/business users of similar kinds.In our experiments, we identify and hire three real Twitterusers, A,B and C to help us evaluate. The ground truth isthat A and B share much more similar profile in terms ofinterests, background and life-style than A and C. However,if we check the common non-core-community users followedby A and B, they have 15 such users in common (shownin Figure 11), while A and C have 18 in common (shownin Figure 12). This means that, without the help of corecommunity, C could be considered more similar to A than

Figure 5: Comparison on distribution of precision, recall and F score.

Figure 7: AUC comparison for rankings with and without incorporating iteration information.

values are greater than 0.8. The right graph in Figure 7shows that in most cases, the ranking with iteration informa-tion incorporated is superior than the ranking based solelyon closeness score. This demonstrates that core communityinformation helps the ranking.

5.5 On IterationIt has been observed in our experiments that the core com-munity discovery process ends after a few iterations. Oneinteresting question is whether core community membersidentified in later iterations are as good as those found inearlier iterations. If we set a maximum number of iterationallowed in the algorithm to force termination, will the resultgive better precision and recall? Our experiments suggesta negative answer. Figure 8 shows that the average pre-cision, recall and F-score for varied maximum number ofiterations allowed from 1 to 10 as well as unlimited. As themaximum number of iterations allowed increases, althoughaverage precision drops slightly, recall improves significantly,and so does the F-score. Intuitively, earlier iterations tendto capture those closest members to the target user, whichresults in a higher precision yet at the cost of missing outmany other core community members with more sophisti-cated social connections with the target user. By setting nomaximum number of iterations and allowing the core com-munity itself to take shape, much greater gain in recall couldbe achieved, offering a better result overall. In most cases,core communities stabilize after 5 or 6 iterations, as shownin Figure 9 which presents the distribution of number of

iterations of all our evaluation participants.

5.6 Modeling User InterestsHow to model user interests is of critical importance in con-tent recommendation and linkage prediction in Twitter data.Furthermore, our study reveals that core community discov-ery could significantly enhance user interest modeling in thefollowing two aspects: (I) For a target user u, its core com-munity members themselves are less informative in charac-terizing u’s interests than the rest user nodes in the follownetwork. u follow them mostly because they are off-line real-life friends anyway. On the other hand, it is similar interestsor topics that drive u to follow other non-core-communityusers. As such, when investigating u’s interests, the firststep is to distinguish u’s core community from the rest fol-low network. (II). Although the core community membersthemselves may not necessarily reflect u’s interests, thoseusers followed by these core community members neverthe-less could help understand u’s interests, e.g., close friendscould follow media/celebrity/business users of similar kinds.In our experiments, we identify and hire three real Twitterusers, A,B and C to help us evaluate. The ground truth isthat A and B share much more similar profile in terms ofinterests, background and life-style than A and C. However,if we check the common non-core-community users followedby A and B, they have 15 such users in common (shownin Figure 11), while A and C have 18 in common (shownin Figure 12). This means that, without the help of corecommunity, C could be considered more similar to A than

Figure 5: Comparison on distribution of precision, recall and F score.

Figure 7: AUC comparison for rankings with and without incorporating iteration information.

values are greater than 0.8. The right graph in Figure 7shows that in most cases, the ranking with iteration informa-tion incorporated is superior than the ranking based solelyon closeness score. This demonstrates that core communityinformation helps the ranking.

5.5 On IterationIt has been observed in our experiments that the core com-munity discovery process ends after a few iterations. Oneinteresting question is whether core community membersidentified in later iterations are as good as those found inearlier iterations. If we set a maximum number of iterationallowed in the algorithm to force termination, will the resultgive better precision and recall? Our experiments suggesta negative answer. Figure 8 shows that the average pre-cision, recall and F-score for varied maximum number ofiterations allowed from 1 to 10 as well as unlimited. As themaximum number of iterations allowed increases, althoughaverage precision drops slightly, recall improves significantly,and so does the F-score. Intuitively, earlier iterations tendto capture those closest members to the target user, whichresults in a higher precision yet at the cost of missing outmany other core community members with more sophisti-cated social connections with the target user. By setting nomaximum number of iterations and allowing the core com-munity itself to take shape, much greater gain in recall couldbe achieved, offering a better result overall. In most cases,core communities stabilize after 5 or 6 iterations, as shownin Figure 9 which presents the distribution of number of

iterations of all our evaluation participants.

5.6 Modeling User InterestsHow to model user interests is of critical importance in con-tent recommendation and linkage prediction in Twitter data.Furthermore, our study reveals that core community discov-ery could significantly enhance user interest modeling in thefollowing two aspects: (I) For a target user u, its core com-munity members themselves are less informative in charac-terizing u’s interests than the rest user nodes in the follownetwork. u follow them mostly because they are off-line real-life friends anyway. On the other hand, it is similar interestsor topics that drive u to follow other non-core-communityusers. As such, when investigating u’s interests, the firststep is to distinguish u’s core community from the rest fol-low network. (II). Although the core community membersthemselves may not necessarily reflect u’s interests, thoseusers followed by these core community members neverthe-less could help understand u’s interests, e.g., close friendscould follow media/celebrity/business users of similar kinds.In our experiments, we identify and hire three real Twitterusers, A,B and C to help us evaluate. The ground truth isthat A and B share much more similar profile in terms ofinterests, background and life-style than A and C. However,if we check the common non-core-community users followedby A and B, they have 15 such users in common (shownin Figure 11), while A and C have 18 in common (shownin Figure 12). This means that, without the help of corecommunity, C could be considered more similar to A than

Application Example: User Interest Profiling

Figure 11: Interest profile comparison for A and B Figure 12: Interest profile comparison for A and C

bi-directional way and relies on no other attribute informa-tion.

7. CONCLUSIONIn this paper, we proposed the problem of identifying a user’sTwitter core community. We put forward three principles tocharacterize core community members. Based on these prin-ciples, we developed an algorithm to iteratively discover thecore community by random walk with restart. Along withthe core community, our algorithm also generates a list of allusers ranked by their closeness score. We presented a casestudy of a real Twitter user to demonstrate the effectivenessof our algorithm in correctly identifying core communitiesmembers in a number of scenarios. Results manually evalu-ated by real Twitter users are shown to illustrate both theeffectiveness and the robustness of our algorithm. With realuser data, we also discussed using core community to en-hance user interest profiling.

8. REFERENCES[1] L. Adamic and E. Adar. Friends and neighbors on the web.

Social networks, 25(3):211–230, 2003.[2] L. Backstrom and J. Leskovec. Supervised random walks:

predicting and recommending links in social networks. InProceedings of the fourth ACM international conference onWeb search and data mining, pages 635–644. ACM, 2011.

[3] E. Bakshy, J. Hofman, W. Mason, and D. Watts.Everyone’s an influencer: quantifying influence on twitter.In Proceedings of the fourth ACM international conferenceon Web search and data mining, pages 65–74. ACM, 2011.

[4] S. Catanese, P. De Meo, E. Ferrara, and G. Fiumara.Analyzing the facebook friendship graph. Arxiv preprintarXiv:1011.5168, 2010.

[5] B. Foucault Welles, A. Van Devender, and N. Contractor.Is a friend a friend?: investigating the structure offriendship networks in virtual worlds. In Proceedings of the28th of the international conference extended abstracts onHuman factors in computing systems, pages 4027–4032.ACM, 2010.

[6] E. Gilbert and K. Karahalios. Predicting tie strength withsocial media. In Proceedings of the 27th internationalconference on Human factors in computing systems, pages211–220. ACM, 2009.

[7] I. Kahanda and J. Neville. Using transactional information

to predict link strength in online social networks. InProceedings of the Third International Conference onWeblogs and Social Media (ICWSM), 2009.

[8] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, asocial network or a news media? In Proceedings of the 19thinternational conference on World wide web, pages591–600. ACM, 2010.

[9] J. Pan, H. Yang, C. Faloutsos, and P. Duygulu. Automaticmultimedia cross-modal correlation discovery. InProceedings of the tenth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages653–658. ACM, 2004.

[10] O. Phelan, K. McCarthy, M. Bennett, and B. Smyth.Terms of a feather: content-based news recommendationand discovery using twitter. Advances in InformationRetrieval, pages 448–459, 2011.

[11] G. Salton and M. J. McGill. Introduction to moderninformation retrieval. 1983.

[12] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos.Neighborhood formation and anomaly detection inbipartite graphs. In Proceedings of the 5th IEEEInternational Conference on Data Mining, pages 418–425.Houston, Texas, USA, November 27–30 2005.

[13] H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walkwith restart and its applications. In ICDM, pages 613–622,2006.

[14] B. Tu, H. Wu, C. Hsieh, and P. Chen. Establishing newfriendships-from face-to-face to facebook: A case study ofcollege students. In System Sciences (HICSS), 2011 44thHawaii International Conference on, pages 1–10. IEEE,2011.

[15] J. Weng, E. Lim, J. Jiang, and Q. He. Twitterrank: findingtopic-sensitive influential twitterers. In Proceedings of thethird ACM international conference on Web search anddata mining, pages 261–270. ACM, 2010.

[16] S. Wu, J. Hofman, W. Mason, and D. Watts. Who sayswhat to whom on twitter. In Proceedings of the 20thinternational conference on World wide web, pages705–714. ACM, 2011.

[17] R. Xiang, J. Neville, and M. Rogati. Modeling relationshipstrength in online social networks. In Proceedings of the19th international conference on World wide web, pages981–990. ACM, 2010.

[18] W. Zhao, J. Jiang, J. Weng, J. He, E. Lim, H. Yan, andX. Li. Comparing twitter and traditional media using topicmodels. Advances in Information Retrieval, pages 338–349,2011.

Parameters

!  On # of Iterations !  On Robustness

Figure 8: The result for limiting themax # of iterations allowed.

Figure 9: The distribution of # ofiterations.

Figure 10: Robustness

B, contradicting the truth. In fact, we can use core com-munity to remedy the situation. Similar as in the idea ofTF-IDF [11], for target user u, we use the following formulato compute the weight for each non-core-community user v

wu(v) =

|Fv→!

Cu||Cu|

log |Fv→| (9)

As such, for a target user u, we obtain a vector xu where eachdimension is one non-core-community member. For two tar-get users u1 and u2, we compute the similarity between their

interest profile as Sim(u1, u2) =xu1 ·xu2

|xu1 ||xu2 | . In Figure 11 and

Figure 12, we show the relative ratio between user A and B,where the percent for user A on dimension v is computed by

wA(v)wA(v)+wB(v) , and

wB(v)wA(v)+wB(v) for user B. Now if we com-

pare A,B and C again using the core-community-enhancedinterest profile, we have Sim(A,B) = xA · xB = 0.3058 andSim(A,C) = xA · xC = 0.0907, indicating B is much moresimilar to A than C, which is consistent with the groundtruth.

5.7 RobustnessIn our algorithm, we have set the number of followers of theauxiliary dummy node as the median of all others in the fol-low network. It is certainly not the only way to set the value,and we have observed that different settings work better fordifferent user cases. However, as we show in Figure 10, ouralgorithm exhibits certain robustness when we perturb thenumber of followers of the dummy node. We perturbed theoriginal number of followers of the dummy node, i.e., themedian, by −20%, −10%, 10% and 20% respectively. Fig-ure 10 shows that these perturbation result in fairly littlechanges in precision, recall and F-score values.

6. RELATED WORKThe recent boom of online social network services (SNS),e.g., Facebook, LinkedIn, Twitter and so on, has invigo-rated much research interests. One direction is to analyzethe similarity or difference between the SNS and the real-life social network. In particular, [5, 4] have tried to under-stand the underlying similarities between the development

of SNS and real-life social networks. [14] looked at how Face-book has influenced the establishment of new friendship re-lationships. Another related direction is to use SNS to inferreal-life friendship or relationship strength. [1] is an earlywork using hyperlinks and text information on homepages topredict relationships between individuals. [6, 7] consideredfurther information including network topology and interac-tions to predict relationship strength. [17] has approachedthe same problem with a link-based latent variable model.

While the relationship between a user’s online and off-linesocial network has been investigated in standard SNS likeFacebook, few studies have so far pose the same questionson Twitter network. More importantly, compared againstFacebook, Twitter has two important different characteris-tics — (I) As shown in [8], Twitter functions as a mixtureof news media and social network combining features fromboth. (II) Follow links on Twitter are established withoutmutual consent. These unique characteristics make peoplewonder how much Twitter network reflects one’s real-life so-cial network. Our work aims to address these questions. Dueto its unrivaled popularity, Twitter has already attractedhuge amount of research interests from data mining and webcommunity [3, 8, 16, 18, 10]. However, the existing body ofwork has largely focused on exploring its textual content as-pect based on the tweets, e.g., the categorization of tweetsand their traits based on their content [10], the topics of in-terests [15, 18], the quantification of influence based on userattributes and tweet content [3]. While these works havelent valuable insight into the Twitter data, it is our observa-tion that little attention has as yet been given to the follownetwork to be studied by itself.

Random walk with restart (RWR) has been successfully ap-plied in many applications. [9] used it to find correlationsacross different medias. [12] used it to find neighbor nodesin bipartite graphs. [13] developed methods to acceleratethe computation of RWR for large graphs. [2] used super-vised random walk combining network information and theattributes of nodes and edges to predict links in social net-works. The intuition behind [2] is that the “closer” the usersare in the network, the more likely they will interact in thefuture. Although we use RWR to measure user closenesssimilarly, our closeness definition incorporates RWR in a

Figure 8: The result for limiting themax # of iterations allowed.

Figure 9: The distribution of # ofiterations.

Figure 10: Robustness

B, contradicting the truth. In fact, we can use core com-munity to remedy the situation. Similar as in the idea ofTF-IDF [11], for target user u, we use the following formulato compute the weight for each non-core-community user v

wu(v) =

|Fv→!

Cu||Cu|

log |Fv→| (9)

As such, for a target user u, we obtain a vector xu where eachdimension is one non-core-community member. For two tar-get users u1 and u2, we compute the similarity between their

interest profile as Sim(u1, u2) =xu1 ·xu2

|xu1 ||xu2 | . In Figure 11 and

Figure 12, we show the relative ratio between user A and B,where the percent for user A on dimension v is computed by

wA(v)wA(v)+wB(v) , and

wB(v)wA(v)+wB(v) for user B. Now if we com-

pare A,B and C again using the core-community-enhancedinterest profile, we have Sim(A,B) = xA · xB = 0.3058 andSim(A,C) = xA · xC = 0.0907, indicating B is much moresimilar to A than C, which is consistent with the groundtruth.

5.7 RobustnessIn our algorithm, we have set the number of followers of theauxiliary dummy node as the median of all others in the fol-low network. It is certainly not the only way to set the value,and we have observed that different settings work better fordifferent user cases. However, as we show in Figure 10, ouralgorithm exhibits certain robustness when we perturb thenumber of followers of the dummy node. We perturbed theoriginal number of followers of the dummy node, i.e., themedian, by −20%, −10%, 10% and 20% respectively. Fig-ure 10 shows that these perturbation result in fairly littlechanges in precision, recall and F-score values.

6. RELATED WORKThe recent boom of online social network services (SNS),e.g., Facebook, LinkedIn, Twitter and so on, has invigo-rated much research interests. One direction is to analyzethe similarity or difference between the SNS and the real-life social network. In particular, [5, 4] have tried to under-stand the underlying similarities between the development

of SNS and real-life social networks. [14] looked at how Face-book has influenced the establishment of new friendship re-lationships. Another related direction is to use SNS to inferreal-life friendship or relationship strength. [1] is an earlywork using hyperlinks and text information on homepages topredict relationships between individuals. [6, 7] consideredfurther information including network topology and interac-tions to predict relationship strength. [17] has approachedthe same problem with a link-based latent variable model.

While the relationship between a user’s online and off-linesocial network has been investigated in standard SNS likeFacebook, few studies have so far pose the same questionson Twitter network. More importantly, compared againstFacebook, Twitter has two important different characteris-tics — (I) As shown in [8], Twitter functions as a mixtureof news media and social network combining features fromboth. (II) Follow links on Twitter are established withoutmutual consent. These unique characteristics make peoplewonder how much Twitter network reflects one’s real-life so-cial network. Our work aims to address these questions. Dueto its unrivaled popularity, Twitter has already attractedhuge amount of research interests from data mining and webcommunity [3, 8, 16, 18, 10]. However, the existing body ofwork has largely focused on exploring its textual content as-pect based on the tweets, e.g., the categorization of tweetsand their traits based on their content [10], the topics of in-terests [15, 18], the quantification of influence based on userattributes and tweet content [3]. While these works havelent valuable insight into the Twitter data, it is our observa-tion that little attention has as yet been given to the follownetwork to be studied by itself.

Random walk with restart (RWR) has been successfully ap-plied in many applications. [9] used it to find correlationsacross different medias. [12] used it to find neighbor nodesin bipartite graphs. [13] developed methods to acceleratethe computation of RWR for large graphs. [2] used super-vised random walk combining network information and theattributes of nodes and edges to predict links in social net-works. The intuition behind [2] is that the “closer” the usersare in the network, the more likely they will interact in thefuture. Although we use RWR to measure user closenesssimilarly, our closeness definition incorporates RWR in a

Figure 5: Core Community Discovery

set in each round. For that purpose, we introduce an auxil-iary dummy node, v, to provide a threshold to cut the newcore community boundary for each round. v is constructedas a virtual node such that (I). v and the target user u followeach other, i.e., v ∈ Fu←

!Fu→, (II). v only associates with

u, i.e., for each v ∈ (Nku \ {u}), v ∈ (Fv←

"Fv→), and (III).

the number of followers of v is set to be the median of thenumber of followers of all users in u’s k-hop network with thehub users excluded, i.e., |Fv→| = medianv∈(Nk

u\H){|Fv→|}.This dummy node is defined in such a way as to set the lower-bound case for an off-line friend. It simulates the scenarioin which the target user u finds by chance this random userv who has no connections with u’s core community. Findinghim/her interesting, u follows v, who then also follows backsomehow. As such, v represents a connection to u almost asweak as any off-line real-life friend should be. Therefore, ifthe closeness score between u and any user w is even lowerthan that between u and v, w is highly unlikely to be inu’s core community. In Section 5, we show that in fact ouralgorithm is fairly robust with respect to the choice of v’sfollower number.On a high level, the algorithm works in iterations as fol-

lows. Given a target user u, compute the closeness scorebetween u and all the other users as well as v. A rankinglist of all the users together with v in decreasing order of thecloseness score is thus generated. All the users ranked be-fore v are identified as core community members, which endsthe current iteration. In the next iteration, the key point isthat we now treat the whole core community identified sofar as one virtual user node u. Instead of computing thecloseness score between u and all the rest users, this timewe compute the closeness score between u and every otheruser. From the ranking list thus generated, if any user jumpsahead of v in this iteration, the user will be added to thecore community of u, which ends this iteration. So on andso forth. Figure 5 illustrates the process. The target user uis shown in red in the center and the auxiliary dummy nodev is shown in purple. In iteration 1, the core community isjust u itself, which is indicated by the shaded circle coveringu. The highlighted blue nodes and follow links representsFu←

"Fu→. After computing the closeness score cu,v for all

v, three users are found to be ahead of v in the resultingranking list. They are therefore added to the core commu-nity, indicated by their color changed from blue to orange.In iteration 2, we use the new core community u, consistingnow of 4 users, to compute the closeness scores cu,v for allrest nodes v. Those ranked ahead of v will be added to thecore community. The iterations continue until no new usercan be added to the core community, ending the algorithm.As the virtual user node u is actually a set, we now define

RWR and closeness score between a user node i and a set Sas follows.

ri,S =#

j∈S

ri,j (4)

rS,i =#

j∈S

rj,i (5)

ci,S = cS,i = ri,S ∗ rS,i (6)

Given a user node i, the probability transition matrix W ,the restart probability c and a tolerant threshold ϵ, the al-gorithm for computing ri is given in Algorithm 1.

Algorithm 1 NodeRWR

Input: node i, probability transition matrix Wrestart probability c, tolerant threshold ϵ

Output: ri1: Initialize ri ← ei;2: Do

3: r′i ← (1− c)W ri + cei;

4: △ri ← r′i − ri;

5: ri ← r′i;6: While |△ri| > ϵ7: Return ri;

Algorithm 2 CoreCommunityDiscovery(CCD)

Input: target node u, network Nu, restart probability cand tolerant threshold ϵ

Output: core community Cu, iteration register vector irand closeness score cu

1: add auxiliary dummy node v into network Nu;2: construct W from network Nu by Equation 2;3: For each v ∈ Nu

4: rv ← NodeRWR(v,W, c, ϵ);5: For each v ∈ Nu \ {u}6: cu,v ← ru,v ∗ rv,u;7: t ← 0;Cu ← {u}; ir ← 0;8: Do9: t ← t+ 1;10: T ← ∅;11: For each v ∈ Nu

12: If v ∈ Cu and cCu,v > cCu,v

13: irv ← t;14: T ← T

"{v};

15: Cu ← Cu"

T ;16:While |T | > 017:Return Cu, ir, cu;

Algorithm 2 iteratively finds the core community for atarget user u. At Line 1, we add an auxiliary dummy nodev into the network to help us to set the cut-off threshold foreach iteration. Line 2 constructs the probability transitionmatrix for RWR. From Line 3 to Line 6, we compute thecloseness score cu between u and rest of the nodes in Nu,and generates a ranked list. From Line 7 to Line 16, wecompute the core community Cu and, for each core com-munity member, maintain in ir the iteration in which it isidentified.

Figure 6: Case study of a user’s follow network.

5. EXPERIMENTAL STUDYAn implementation of our algorithm as a demo system –

TwiCube1 – is publicly available.

5.1 Case StudyWe now present a case study on a real user X who par-

ticipated in our evaluation. X has 107 followers and follows385 other users. Figure 6 illustrates the discovery of his corecommunity in a total of 4 iterations each indicated by a dif-ferent color. In summary, 34 users are identified in Iteration1, 19 in Iteration 2, 3 in Iteration 3 and only one user inthe last iteration. The precision and recall for this resultof X’s core community is 0.8947 and 0.9807 respectively. Itcan be observed from Figure 6 that there is a dense clustersof core community members heavily linked among one an-other (lower left to X) and another such cluster of non-core-community users similarly linked (upper right to X). Thisshows that approaches based on dense subgraph mining orstructural clustering would have a hard time in distinguish-ing between these two similarly-structured communities and,consequently, identifying the true core community. In fact,this cluster of non-core-community users consists of media,business and active Twitter users sharing similar interestsand topics, which is a good indicator of those of X’s own.In Figure 6, we pick out two particular users, magnify

their follow links with X and present them in two cases (a)and (b) (marked by arrows in the figure). In (a), we showthe follow network between X and a non-core-communityuser“tuniu”, which is a travel business. Note that althoughX and this business node directly follow each other, satis-fying our Principle 1, this node is still correctly excludedfrom the core community by our algorithm. This is mainlybecause it connects mostly with other non-core-communityusers by follow links, exhibiting weak core community affin-

1http://twitterbud2011.appspot.com/

ity withX. This case would fail the naive approach trying toidentify core community members by two-way follow links.In (b), we show the follow networks between X and a corecommunity member Y , who is discovered in Iteration 3. Inthis case, X follows Y but Y does not follow X. Moreover, itis not until more core community members have been iden-tified at Iteration 1 and 2 that Y ’s sophisticated connectionswith the core community are revealed. In this tricky case,by unleashing the power of iterated core community identi-fication, our algorithm is still able to correctly identify Y .

5.2 EffectivenessOne naive method to identify the core community of a tar-

get user u is to find the set of users who have direct two-wayfollow links with u, i.e., they and u follow each other. Do di-rect two-way follow links provide good indication for off-linereal-world friendship? Our experiments suggest that theselinks are not sufficient. In Figure 7 we show the comparisonon the distribution (among the 65 user evaluations)of pre-cision, recall and F score between our algorithm CCD andthe naive algorithm. In general our solution outperformsthe naive solution by a large margin. To conduct more de-tailed comparison between the two methods, let’s take acloser examination at each user. We compute the differenceof precision and recall between two solutions for each user.In Figure 8, each point represents one user and the coordi-nate is defined as (PCCD − Pnaive, RCCD − Rnaive) wherePCCD and RCCD is the precision and recall of our algorithmrespectively, Pnaive and Rnaive is the precision and recall ofthe naive approach respectively. The result shows that formost users, our solution outperforms the naive solution forboth precision and recall. In particular, in two cases, thedifference is even close to 1. There is only one single casein which our algorithm is prevailed for both precision andrecall.

A  real  TwiFer  user:      §  Following  385  users  §  Followed  by  107  users  

Page 11: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

研究课题-线下亲密关系挖掘 Problem:    Given  a  user’s  tweets,  iden?fy  all  interpersonal  rela?onships  that  involve  physical  or  emo?onal  in?macy,  such  as  family  members,  husband  and  wife,  roman?c  rela?onship,  etc..    

Example:  §  In.mate  expressions  

§  “honey”,  “baby”,  “dear”,  “my  dear  wife”,…  §  Occasions/Events  

§  Valen?ne’s  day,  anniversary,  father’s  day,  birthday,…  

§  In.macy-­‐related  name  en..es  §  Resort  hotels,  kids,  home-­‐improvement,  …  

§  Screen-­‐name  correla.on  §  Substring  swaps  §  Similar  PaNerns  with  keywords  §  PaNerns  with  domain  knowledge  

Design Ideas I Intimacy-related Entity

Use  Dempster–Shafer  theory  to  model  the  associa?on  degree  between  en??es  and  a  certain  type  of  rela?onship.    The  final  in?mate  rela?onship  scores  are  achieved  through  an  itera?ve  algorithm.    

Design Ideas II: Exclusivity of “@” to identify relationship candidates

Page 12: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

外部数据跨平台用户身份归一

Linkage Information Collection

Photos

Tweets/Retweets

Trajectories

. . .

Profiles

Username

Photos

Tweets/Retweets

Trajectories

. . .

Profiles

Username

t

Unlinked Identities …

Step 3: Multi-objective Optimization MinW [F1(w), F2(w),…,  FM(w)]

Linkage Function fW

Unknown Identities Step 2: Structure Information

Modeling

Step 1:Heterogeneous Behavior Modeling

Figure 3: HYDRA framework.Step 2. Structure Information Modeling. We construct the struc-ture consistency graph on user pairs by considering both the corenetwork structure of the users and their behavior similarities. De-tails are discussed in Section 5.Step 3. Multi-objective Optimization with Missing Informa-tion: Based on the previous two steps, we convert the SIL prob-lem into a two-class classification problem and construct multi-objective optimization which jointly optimizes the prediction ac-curacy on the labeled user pairs and multiple structure consistencymeasurements across different platforms. Details are discussed inSection 6.

5. HETEROGENEOUS BEHAVIOR MODELThe key challenges in modeling user behavior across different

social media platforms are (I) the heterogeneity of user social dataand (II) the temporal misalignment of user behavior across plat-forms. The high heterogeneity of user social data can be appre-ciated by the following categorization of all the data about a useravailable on a typical social platform.

1. User Attributes. Included here are all the traditional struc-tured data about a user, e.g., demographic information, con-tact, etc. (Subsection 5.1)

2. User Generated Content (UGC). Included here are the un-structured data generated by users such as text (reviews, micro-blogs, etc.), images, videos and so on. Modeling is primarilytargeted at topic (Subsection 5.2) and style (Subsection 5.3).

3. User Behavior Trajectory. User behavior trajectory refersto all the social behavior of a user as exhibited on the plat-forms along the time-line, e.g., befriend, follow/unfollow,retweet, thumb-up/thumb-down, etc. (Subsection 5.4)

4. User Core Social Network. A user’s core social network isthe social network formed among those who are the closet tothe user. (Subsection 5.5)

To address these two challenges, we propose a behavior mod-eling framework which computes similarity between users from avariety of aspects to effectively capture the heterogeneous behavioras well as the characteristics of their temporal evolution.

5.1 User Attribute ModelingTextual Attributes. The profile information is informative in

distinguishing different users. Common textual attributes in a userprofile include name, gender, age, nationality, company, education,email account, etc. A simple matching strategy can be built on sucha set of information. However, the relative importance of theseattributes are not identical, because attributes such as gender andcommon names like “John" are not as discriminative as others suchas email address in identifying user linkage. Yet, the weights of the

Figure 4: The workflow of face recognition for identity linkage.A face detector is employed to extract the face from a pair ofprofile images. Then a pre-trained face classifier outputs a con-fidence score in [0, 1] indicating how likely the two faces belongto one person.attributes used in the matching can be learned from large trainingset by probabilistic modeling.

Specifically, given a set of N labeled training user pairs fromdifferent platforms, the relative importance of the attributes canbe estimated by data counting. For a specific attribute a

k

, k =

1, ...,MA

, we estimate the relative importance score by the follow-ing equation:

mt

(k) = PD(k)PD(k)+ND(k) ,

mt

(k) = mt(k)+"

MAP

k0=1

mt(k0)+MA"

(3)

where PD

(k) represents the number of user pairs matched on ak

in the positive labeled set PD

, and ND

(k) represents the numberof pairs matched ona

k

in the negative labeled set ND

. " denotes asmall real number that avoids over-fitting.

Given a user pair, an MA

dimensional attribute matching featurecan be calculated. For example, if the user pair (i, i0) is matchedon 1st, 2nd, and 5-th attributes, where the corresponding weight ofthem are 0.1, 0.3, and 0.2, respectively, then the attribute feature ofthe user pair is [0.1, 0.3, 0, 0, 0.2, ...]. If any k-th attribute of user ior i0 is absent, we denote the k-th feature as missing.

Visual Attributes. Besides textual attributes, visual attributessuch as face images used in the profile can also be used to linkusers. However, as many users may not use their true face images,or use those with poor illumination and severe occlusion, such in-formation could be very noisy. We designed a matching scheme asshown in Figure 4 to safely compare two user profile images. Inparticular, if faces have been detected from both images, the pre-trained classifier is used to determine if the two faces correspond tothe same person. We use the face detector, facial feature extractionand face classifier provided by [14].

5.2 User Topic ModelingAn important feature of social media platform is that in general,

over a sufficiently long period of time, the UGC of a user collec-tively gives a faithful reflection of the user’s topical interest. Fak-ing one’s interests all the time defeats the purpose of using a socialnetwork service. Therefore, we propose to model a user’s topicalinterest by a long-term user topic model. We first construct a latenttopic model using Latent Dirichlet Allocation on each textual mes-sage, the output of which is a probability distribution in the topicspace. We then calculate the multi-scale temporal topic distributionwithin a given temporal range for a user using the multi-scale tem-poral division similar to [19]. Specifically, as shown in Figure 5,the time axes is firstly divided into multiple time buckets with dif-ferent scales (we use 1, 2, 4, 8, 16 and 32 days in this paper,whichguarantees the optimal performance), then all the topic distributionvectors within each buckets are accumulated into a single distribu-tion, which represents the topic distribution pattern within this timebuckets. In Figure 5, C

t

denotes the number of time buckets whenthe scale is selected to be 16. Correspondingly, the number of time

•  Nodal  aFributes  (numeric,  categorical)  

•  Demographics,  loca?on,  personal  interest,  etc.    

•  User  Generated  Content  (topics,  sen.ments)  

•  Reviews,  tweets,  ra?ngs,  mul?media,  etc.  

•  Social  network  (snapshot/sta.c  view)  

•  Friend  network,  followers/followees  network,  communi?es/interest  groups,  etc.  

•  Behavior  trajectory  (dynamic,  evolu.onary)  

•   content  sharing  history,  social  interac?on  paNern,  network  forma?on,  etc.      

Page 13: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

外部数据跨平台用户身份归一 •  People’s  closest  friends  are  similar  across  different  social  plaaorms.  

   •  Behavior  similarity  aggrega?on  of  the  most  frequently  interac?ng  friends  of  users  provides  insights  into  user  iden?ty  linkage.  

•  Supervised  Learning  •  Structure  Consistency  Modeling  

•  Mul?-­‐objec?ve  Op?miza?on  A  two-­‐class  classifica?on  problem  -­‐-­‐-­‐  construct  mul?-­‐objec?ve  op?miza?on  which  jointly  op?mizes  the  predic.on  accuracy  

on  the  labeled  user  pairs  and  mul.ple  structure  consistency  measurements  across  different  plaaorms.  

Page 14: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

社交媒体大数据的核心: 5个“C •  Content  内容  

–  个人档案,话题分布,情感模型,兴趣画像.  

•  Context  情景  –  地点,时序分析,行为轨迹,社群分析.        

•  Connec.on  关联  –  线下关系挖掘,核心网络分析.  

•  Crowd  众智  –  利用大众的人脑智慧,众包,众筹.  

•  Cloud  云平台  –  开发多源的思维模式.  

 

社交媒体

大数据

内容

Content  

情景

Context  

云平台

Cloud  众智

Crowd  

关联

Connec.on  

Page 15: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

社交媒体大数据的个人征信应用 •  弥补个⼈人信⽤用数据的稀疏性

•  在中国,官⽅方正式的个⼈人信⽤用数据匮乏,尤其是中低收⼊入层次的申请⼈人,⽽而这部分⼈人群正是互联⺴⽹网⾦金融的主要⺫⽬目标客户。

•  冷启动

•  对抗恶意欺诈 •  社交数据和⾦金融领域的弱相关

•  侦测异地诈骗

•  挖掘⻛风险的前瞻性 •  利⽤用⽣生活情景的时序推理

•  深挖信⽤用⻛风险的社会关系传递

Page 16: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

社交媒体大数据的个人征信应用

•  提取社交维度信⽤用特征,加⼊入现有传统信⽤用模型

•  采⽤用产⽣生式模型挖掘不同信⽤用类别的隐含⽤用户原型

•  基于社会关系⺴⽹网络的⻛风险传递查询和探索引擎

•  实时反欺诈侦测和预警系统

应用模式

Page 17: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

社交媒体大数据的个人征信应用 •  Upstart Upstart于2014年5月上线,2014年促成了超过8700笔贷款共计1亿250万美元,良好的运营业绩使之成为P2P行业新参与者中的佼佼者。该平台的借款对象专注于千禧一代(1984-­‐1995年出生),即80后、90初的年轻群体。  

Page 18: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

•  组合较优的独立特征为复合特征,加入传统模型。  

•  使用决策树组合:地理位置特征 (所在地、签到地点)根据各个特征上Good、Bad分布的差异性,选出特征放入决策树。  

•  根据数据生成的决策树如下表。表中用不同的颜色来区分决策树的层次,黄色为第一层,绿色为第二层,蓝色为第三层。表中的数值表示满足该条件的人群是坏人的风险指数。基于此决策树模型,分类的准确率达到0.83。  

提取社交维度信用特征,加入现有传统信用模型  

Page 19: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

提取社交维度信用特征,加入现有传统信用模型  

We choose to define features in a simpler way by aggregating thedegree features of ui’s followees and followers respectively. Theaggregated features can reflect the connection values of ui, and inthis way represent the social status of ui very well.

• Degree featuresThe numbers and the ratio between the statistics of user’ssocial connections are all good for representing user’s so-cial status. In light of this, we define features respect tothe network degree of users as follows: #followers, numberof followers; #followees, number of followers; #friends,number of friends; #friends/#followers, fraction of follow-ers that are also followees; #friends/#followees, fractionof followees that are also followers; #followers/#followees,fraction between number of followers and followees, and#followers/#followees, sum of number of follwers and fol-lowees.

• Subnetwork aggregated featuresAccording to degree features, we define aggregated subnet-work features of ui by first computing the degree featuresof users in ui’s neighbor Vi, and then compute the meanand variance of ui’s neighbor degree feature values. It isworth noting that we can obtain 3 sets of aggregated subnet-work features if we consider ui’s followers, followees, andthe combination of them separately. Similarly, we can alsoobtain aggregated features by computing the mean and vari-ance of number of microblogs of users ∈ Vi directly. Basedon the subnetwork structure, we also introduce features likebetweenness centrality and page rank.

The proposed network features overcome the shortcomings ofno complete network of users, and capture the overall social statusof users to a large extent. One with good social status of coursehas excellent aptitude at certain areas to attract followers and fol-lows those with good knowledge of certain areas too. Generallyspeaking, the overall social statuses information revealed by thesenetwork features also correlated with the economic levels, guaran-teeing the capability of protecting their credit.

3.5 Effectiveness of Proposed FeaturesIn this Section, we will empirically evaluate the effectiveness of

different types of features described in the Section 3.4. Differentfrom the experiment part, we focus on the effectiveness analysisfor each feature separately. To this end, we compute the PearsonCorrelation coefficient and χ2 Statistics between each feature andcredit label. Aside from these statistics, we also demonstrate fea-ture’s relative prediction performance by comparing feature impor-tance value obtained from GBDT models. The feature importancecomparison is done within each type of features for simplicity. Foreach feature name, we only report the results of the most represen-tative feature under it since features under the same feature nameare often correlated with each other and some feature names con-tain too many features to list, like posting time distribution andtopic distribution.

3.5.1 Demographic FeaturesIn Table 5, we show the comparison between different demo-

graphic features w.r.t. Pearson Correlation coefficients and χ2 statis-tics. The low Pearson Correlation values show that there are onlyvery weak linear dependencies between user’s demographic fea-tures and credit labels, which is consistent with reality and our intu-ition. On the other hand, most of features pass the significance testusing these χ2 statistics as measure. Features that fail to pass like

Fid Feature Name Pearson Correlation χ2 Statistics1 Gender 4.45× 10−2 14.27∗

2 Age 1.92× 10−2 16.28∗

3 Verified 5.128× 10−2 17.02∗

4 Education 4.18× 10−3 05 Location 4.81× 10−2 16.68∗

6 Occupation 2.244× 10−2 0.1377 Registration time 6.944× 10−2 39.44∗

∗ Passes the significance test at the confidence level of 95%.Table 5: Pearson correlation and χ2 statistics evaluation fordemographic features

0

10

20

30

40

50

1 2 3 4 5 6 7Fid

Impo

rtanc

e Va

lue

(a) Demographic features

0

2

4

6

8

1 2 3 4 5 6 7 8 9 10Fid

Impo

rtanc

e Va

lue

(b) Microblog features

−2

0

2

4

6

8

1 2 3 4 5 6 7 8 9 10 11 12Fid

Impo

rtanc

e Va

lue

(c) Behavior features

0

5

10

15

1 2 3 4 5 6 7 8Fid

Impo

rtanc

e Va

lue

(d) Network features

Figure 5: Feature importance comparison between groups offeatures listed in Table 5, 6, 7, and 8 using box plots.

education and occupation are due to extreme data sparsity. Theirvalues are as much as 99.5% missing in our dataset. In Figure 5 (a),we can see that (1) feature importance distribution are not alwaysconsistent with statistical analysis; (2) Age and Registration timefeatures are much more predictive than the rest.

3.5.2 Microblog FeaturesSimilar to demographic features, Table 6 shows the Pearson Cor-

relation and χ2 statistics about microblog features. The χ2 statis-tics of microblogs containing URL, microblogs containing onlymentions, Naive Bayes based class probabilities and topic distri-butions are very large. Specifically, the value of Naive Bayes basedclass probabilities is as large as 467.5, indicating that feature 9 isvery informative for credit predictions. Figure 5 (b) present a moreintuitive comparison between the 10 kinds of microblog featuresw.r.t. feature importance in learnt GBDT model. We can see thatFeature 9 is also the most prominent in feature importance. Due tothe strong predictive power of Feature 9, all the rest’s importancevalues lie between 0 and 5 although their predictive power is alsovery strong compared to features in other group.

3.5.3 Behavior FeaturesFor behavior features, we list the statistical information for each

feature name in Table 7. Half of the features, including features re-lated to retweet chain, plain retweet, emoticon and mention, post-ing time distribution and active level, are statistically significantfor credit label attributes according to χ2 test. Features from post-

Fid Feature Name Pearson Correlation χ2 Statistics1 Length 5.546× 10−2 48.04∗

2 Containing images 4.149× 10−2 3.6503 Containing URL 1.827× 10−2 58.02∗

4 Conta. HashTag 3.422× 10−2 2.3765 Conta. only mentions 6.114× 10−2 21.63∗

6 Conta. only emotions 5.504× 10−2 9.475∗

7 Grant of “badges” 2.212× 10−2 6.449∗

8 Commercial purpose 1.134× 10−2 2.0269 N. B. based prob. 7.716× 10−2 25.76∗

10 Topic distributions 5.370× 10−2 39.44∗

∗ Passes the significance test at the confidence level of 95%.Table 6: Pearson correlation and χ2 statistics evaluation formicroblog features

Fid Feature Name Pearson Correlation χ2 Statistics1 Near Duplicate 2.740× 10−2 2.6422 Retweet Chain 9.200× 10−2 53.05∗

3 Plain Retweet 3.374× 10−2 34.61∗

4 Emoticon behavior 8.637× 10−2 25.68∗

5 Mention behavior 6.236× 10−2 28.10∗

6 Posting time 5.162× 10−2 61.06∗

7 Metaphysical power 4.370× 10−2 0.6608 Active level 4.770× 10−2 31.77∗

9 Sentiment word(+) 4.240× 10−2 0.38010 Sentiment word(-) 5.063× 10−2 0.09211 Sentiment ploarity(+) 2.602× 10−2 4.85112 Sentiment ploarity(-) 9.272× 10−3 2.268

∗ Passes the significance test at the confidence level of 95%.Table 7: Pearson correlation and χ2 statistics evaluation forbehavior features

ing time are especially important since their chi2 statistics are allconsiderable high and there are 24 different features of this kind.Figure 5 (c) shows the feature importance when behavior featuresare used as input for GBDT model. Their importance values areall comparable with each other, and the low importance values alsovalidate the intuition that behavior information only indirectly andlimitedly reflect user’s credit risk. Although the feature importanceof each feature is not very high as a whole, the combination of somany predictive behavior features also demonstrates very high per-formance, as will be shown in the experiment part.

3.5.4 Network Features

Fid Feature Name Pearson Correlation χ2 Statistics1 #followees 4.651× 10−2 23.62∗

2 #followers 1.922× 10−3 8.084∗

3 #friends 1.446× 10−2 0.1364 #friends/#followees 1.701× 10−2 0.1365 #followers+#followees 3.476× 10−2 0.0596 Aggregated feature 1 2.961× 10−2 3.8447 Aggregated feature 4 2.831× 10−2 2.0008 Betweenness Cetnrality 2.237× 10−2 3.658

∗ Passes the significance test at the confidence level of 95%.Table 8: Pearson correlation and χ2 statistics evaluation fornetwork features

Table 8 and Figure 5 (d) present the effectiveness analysis ofnetwork features proposed in Section 3.4. As some network fea-tures’ correlation value and χ2 statistics are very low, we don’tlist all of them in the table. Among network features, only #fol-

lowees and #followees pass the significance test according to χ2

test at confidence level of 95%. But the feature importance com-parison in Figure 5 (d) shows that aggregated features are muchmore important features for credit prediction than degree features.This phenomenon shows that both degree features and aggregateddegree features are informative for credit evaluation, and they makepredictions in different ways.

4. EXPERIMENTS

4.1 Experiment SetupData Sets.

Description Value#user of good credit 63500#user of bad credit 3699Total Number of microblogs 7,216,087#Microblogs by good credit users 6,914,389#Microblogs by bad credit users 301,698Total number of words 12,301,485Size of vocabularies 241,197

Table 9: Statistics of Sina Weibo dataset used for performanceevaluation

Method.Baseline 1Baseline 2Baseline 3EvaluationMetrics Accuracy, F1, Precision, Recall, NDCG@KROC curve on imbalanced dataset.PR-Curve on imbalanced dataset to overcome the repetition of

big tables.comparison of supervised learning algorithms? PR-curve

SystemConfiguration

4.2 Evaluation on Balanced Dataset

4.3 Evaluation on Realistic Imbalanced Dataset

Fid Feature Name Pearson Correlation χ2 Statistics1 Length 5.546× 10−2 48.04∗

2 Containing images 4.149× 10−2 3.6503 Containing URL 1.827× 10−2 58.02∗

4 Conta. HashTag 3.422× 10−2 2.3765 Conta. only mentions 6.114× 10−2 21.63∗

6 Conta. only emotions 5.504× 10−2 9.475∗

7 Grant of “badges” 2.212× 10−2 6.449∗

8 Commercial purpose 1.134× 10−2 2.0269 N. B. based prob. 7.716× 10−2 25.76∗

10 Topic distributions 5.370× 10−2 39.44∗

∗ Passes the significance test at the confidence level of 95%.Table 6: Pearson correlation and χ2 statistics evaluation formicroblog features

Fid Feature Name Pearson Correlation χ2 Statistics1 Near Duplicate 2.740× 10−2 2.6422 Retweet Chain 9.200× 10−2 53.05∗

3 Plain Retweet 3.374× 10−2 34.61∗

4 Emoticon behavior 8.637× 10−2 25.68∗

5 Mention behavior 6.236× 10−2 28.10∗

6 Posting time 5.162× 10−2 61.06∗

7 Metaphysical power 4.370× 10−2 0.6608 Active level 4.770× 10−2 31.77∗

9 Sentiment word(+) 4.240× 10−2 0.38010 Sentiment word(-) 5.063× 10−2 0.09211 Sentiment ploarity(+) 2.602× 10−2 4.85112 Sentiment ploarity(-) 9.272× 10−3 2.268

∗ Passes the significance test at the confidence level of 95%.Table 7: Pearson correlation and χ2 statistics evaluation forbehavior features

ing time are especially important since their chi2 statistics are allconsiderable high and there are 24 different features of this kind.Figure 5 (c) shows the feature importance when behavior featuresare used as input for GBDT model. Their importance values areall comparable with each other, and the low importance values alsovalidate the intuition that behavior information only indirectly andlimitedly reflect user’s credit risk. Although the feature importanceof each feature is not very high as a whole, the combination of somany predictive behavior features also demonstrates very high per-formance, as will be shown in the experiment part.

3.5.4 Network Features

Fid Feature Name Pearson Correlation χ2 Statistics1 #followees 4.651× 10−2 23.62∗

2 #followers 1.922× 10−3 8.084∗

3 #friends 1.446× 10−2 0.1364 #friends/#followees 1.701× 10−2 0.1365 #followers+#followees 3.476× 10−2 0.0596 Aggregated feature 1 2.961× 10−2 3.8447 Aggregated feature 4 2.831× 10−2 2.0008 Betweenness Cetnrality 2.237× 10−2 3.658

∗ Passes the significance test at the confidence level of 95%.Table 8: Pearson correlation and χ2 statistics evaluation fornetwork features

Table 8 and Figure 5 (d) present the effectiveness analysis ofnetwork features proposed in Section 3.4. As some network fea-tures’ correlation value and χ2 statistics are very low, we don’tlist all of them in the table. Among network features, only #fol-

lowees and #followees pass the significance test according to χ2

test at confidence level of 95%. But the feature importance com-parison in Figure 5 (d) shows that aggregated features are muchmore important features for credit prediction than degree features.This phenomenon shows that both degree features and aggregateddegree features are informative for credit evaluation, and they makepredictions in different ways.

4. EXPERIMENTS

4.1 Experiment SetupData Sets.

Description Value#user of good credit 63500#user of bad credit 3699Total Number of microblogs 7,216,087#Microblogs by good credit users 6,914,389#Microblogs by bad credit users 301,698Total number of words 12,301,485Size of vocabularies 241,197

Table 9: Statistics of Sina Weibo dataset used for performanceevaluation

Method.Baseline 1Baseline 2Baseline 3EvaluationMetrics Accuracy, F1, Precision, Recall, NDCG@KROC curve on imbalanced dataset.PR-Curve on imbalanced dataset to overcome the repetition of

big tables.comparison of supervised learning algorithms? PR-curve

SystemConfiguration

4.2 Evaluation on Balanced Dataset

4.3 Evaluation on Realistic Imbalanced Dataset

Fid Feature Name Pearson Correlation χ2 Statistics1 Length 5.546× 10−2 48.04∗

2 Containing images 4.149× 10−2 3.6503 Containing URL 1.827× 10−2 58.02∗

4 Conta. HashTag 3.422× 10−2 2.3765 Conta. only mentions 6.114× 10−2 21.63∗

6 Conta. only emotions 5.504× 10−2 9.475∗

7 Grant of “badges” 2.212× 10−2 6.449∗

8 Commercial purpose 1.134× 10−2 2.0269 N. B. based prob. 7.716× 10−2 25.76∗

10 Topic distributions 5.370× 10−2 39.44∗

∗ Passes the significance test at the confidence level of 95%.Table 6: Pearson correlation and χ2 statistics evaluation formicroblog features

Fid Feature Name Pearson Correlation χ2 Statistics1 Near Duplicate 2.740× 10−2 2.6422 Retweet Chain 9.200× 10−2 53.05∗

3 Plain Retweet 3.374× 10−2 34.61∗

4 Emoticon behavior 8.637× 10−2 25.68∗

5 Mention behavior 6.236× 10−2 28.10∗

6 Posting time 5.162× 10−2 61.06∗

7 Metaphysical power 4.370× 10−2 0.6608 Active level 4.770× 10−2 31.77∗

9 Sentiment word(+) 4.240× 10−2 0.38010 Sentiment word(-) 5.063× 10−2 0.09211 Sentiment ploarity(+) 2.602× 10−2 4.85112 Sentiment ploarity(-) 9.272× 10−3 2.268

∗ Passes the significance test at the confidence level of 95%.Table 7: Pearson correlation and χ2 statistics evaluation forbehavior features

ing time are especially important since their chi2 statistics are allconsiderable high and there are 24 different features of this kind.Figure 5 (c) shows the feature importance when behavior featuresare used as input for GBDT model. Their importance values areall comparable with each other, and the low importance values alsovalidate the intuition that behavior information only indirectly andlimitedly reflect user’s credit risk. Although the feature importanceof each feature is not very high as a whole, the combination of somany predictive behavior features also demonstrates very high per-formance, as will be shown in the experiment part.

3.5.4 Network Features

Fid Feature Name Pearson Correlation χ2 Statistics1 #followees 4.651× 10−2 23.62∗

2 #followers 1.922× 10−3 8.084∗

3 #friends 1.446× 10−2 0.1364 #friends/#followees 1.701× 10−2 0.1365 #followers+#followees 3.476× 10−2 0.0596 Aggregated feature 1 2.961× 10−2 3.8447 Aggregated feature 4 2.831× 10−2 2.0008 Betweenness Cetnrality 2.237× 10−2 3.658

∗ Passes the significance test at the confidence level of 95%.Table 8: Pearson correlation and χ2 statistics evaluation fornetwork features

Table 8 and Figure 5 (d) present the effectiveness analysis ofnetwork features proposed in Section 3.4. As some network fea-tures’ correlation value and χ2 statistics are very low, we don’tlist all of them in the table. Among network features, only #fol-

lowees and #followees pass the significance test according to χ2

test at confidence level of 95%. But the feature importance com-parison in Figure 5 (d) shows that aggregated features are muchmore important features for credit prediction than degree features.This phenomenon shows that both degree features and aggregateddegree features are informative for credit evaluation, and they makepredictions in different ways.

4. EXPERIMENTS

4.1 Experiment SetupData Sets.

Description Value#user of good credit 63500#user of bad credit 3699Total Number of microblogs 7,216,087#Microblogs by good credit users 6,914,389#Microblogs by bad credit users 301,698Total number of words 12,301,485Size of vocabularies 241,197

Table 9: Statistics of Sina Weibo dataset used for performanceevaluation

Method.Baseline 1Baseline 2Baseline 3EvaluationMetrics Accuracy, F1, Precision, Recall, NDCG@KROC curve on imbalanced dataset.PR-Curve on imbalanced dataset to overcome the repetition of

big tables.comparison of supervised learning algorithms? PR-curve

SystemConfiguration

4.2 Evaluation on Balanced Dataset

4.3 Evaluation on Realistic Imbalanced Dataset

•  发帖时间分布  •  手机终端  •  签到地区分布  •  签到地区时间跨度   0.52  

0.54  

0.56  

0.58  

0.6  

0.62  

1   3   5   7   9   11   13   15   17   19   21  Number  of  Features  

Accuracy  

accuracy  

Page 20: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

采用产生式模型挖掘不同信用类别的隐含用户原型  

Page 21: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

基于社会关系网络的风险传递查询和探索引擎  

产生种子   网络拓展   业务应用  

内部不良客户名单  

外部大数据平台  •  社交媒体举报名单  

•  互联网金融类网站不良记录名单  

•  政府公共信息平台不良纪录名单  

•  事件新闻触发名单  

多元数据挖掘维度  

•  用户内容分析(主题,意见,情感等)  

•  上下文情景分析(时空序列,地理位置等)  

•  社会关系网分析(家庭,同事,好友,社区等)  

海量客户自动评分  

交互式侦测调查系统  

Page 22: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

实时反欺诈侦测和预警系统  

Page 23: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

实时反欺诈侦测和预警系统  

Page 24: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

社交媒体大数据用于个人信用评估的优势  

•  基于⽤用户个⼈人数据可以建⽴立⽤用户个⼈人信⽤用评分 •  个⼈人数据:海量,全⽅方位,动态实时,场景理解 •  分析⼿手段:

•  内容分析:兴趣爱好(赌博,⾊色情,奢侈品⾼高消费等),个⼈人素质(粗俗⽤用语,说谎,⾃自相⽭矛盾),性格特征(易怒,偏激,鲁莽,冲动)

•  上下⽂文场景分析:⾏行动轨迹(是否居⽆无定所,出⼊入不良场所,出没于诈骗⾼高发地区),⽣生活习惯(夜⽣生活,发帖时间),使⽤用设备(⼿手机类型配置)

•  基于⽤用户社交⺴⽹网络可以建⽴立⽤用户综合信⽤用评估,挖掘潜在信⽤用⻛风险 •  社交⺴⽹网络:⽤用户的核⼼心⺴⽹网络(家庭,好友,合作伙伴) •  分析⼿手段:

•  基于⺴⽹网络的信⽤用推导(例如:是否和信⽤用不良⼈人⼠士关系密切)

Page 25: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

社交大数据用于金融创新的挑战和课题

•  The  “CANNOTs  (or  SHOULD-­‐NOTs)”:  the  boundaries  and  fron.ers  –  Privacy  

• How  to  provide  non-­‐intrusive  yet  personalized  customer  service?  • Where  is  the  boundary  between  public  and  private  data?  

– Ownership  • Who  should  own  the  data  shared  on  various  plaaorms?  • How  to  split  profit  from  the  data?  

– Valua?on  • How  to  assess  value  for  different  data  sets?  • How  to  promote  and  regulate  data  exchange  among  par?es?  

   

Page 26: BDTC2015-新加坡管理大学-朱飞达

大数据与金融创新:从研究到实践

Questions

[email protected]