social network fusion and mining: a survey · 2018-05-03 · how to fuse: the data fusion strategy...

Social Network Fusion and Mining: A Survey

Jiawei ZhangIFM Lab

Florida State University, Tallahassee, FL 32311, USA

[email protected]

ABSTRACTLooking from a global perspective, the landscape of onlinesocial networks is highly fragmented. A large number ofonline social networks have appeared, which can provideusers with various types of services. Generally, the infor-mation available in these online social networks is of diversecategories, which can be represented as heterogeneous so-cial networks (HSN) formally. Meanwhile, in such an ageof online social media, users usually participate in multipleonline social networks simultaneously to enjoy more socialnetworks services, who can act as bridges connecting differ-ent networks together. So multiple HSNs not only representinformation in single network, but also fuse information frommultiple networks.

Formally, the online social networks sharing common usersare named as the aligned social networks, and these sharedusers who act like anchors aligning the networks are calledthe anchor users. The heterogeneous information gener-ated by users’ social activities in the multiple aligned so-cial networks provides social network practitioners and re-searchers with the opportunities to study individual user’ssocial behaviors across multiple social platforms simultane-ously. This paper presents a comprehensive survey aboutthe latest research works on multiple aligned HSNs studiesbased on the broad learning setting, which covers 5 majorresearch tasks, i.e., network alignment, link prediction, com-munity detection, information diffusion and network embed-ding respectively.

KeywordsBroad Learning; Heterogeneous Social Networks; NetworkAlignment; Link Prediction; Community Detection; Infor-mation Diffusion; Network Embedding; Data Mining

1. INTRODUCTIONIn the real world, on the same information entities, e.g.,products, movies, POIs (points-of-interest) and even humanbeings, a large amount of information can actually be col-lected from various sources. These sources are usually ofdifferent varieties, like Walmart vs Amazon for commer-cial products; IMDB vs Rotten Tomatoes for movies; Yelpvs Foursquare for POIs; and various online social mediumwebsites vs diverse offline shopping, traveling, living serviceproviders for human beings. Each information source pro-

vides a specific signature of the same entity from a uniqueunderlying aspect. However, in many cases, these infor-mation sources are usually separated in difference places,and an effective fusion of these different information sourcesprovides an opportunity for researchers and practitioners tounderstand the entities more comprehensively, which ren-ders broad learning [135; 127; 151] an extremely importantlearning task.

Broad learning introduced in [135; 127; 151] is a new type oflearning task, which focuses on fusing multiple large-scale in-formation sources of diverse varieties together and carryingout synergistic data mining tasks across these fused sourcesin one unified analytic. Fusing and mining multiple infor-mation sources of large volumes and diverse varieties arealso the fundamental problems in big data studies. Broadlearning investigates the principles, methodologies and algo-rithms for synergistic knowledge discovery across multiplealigned information sources, and evaluates the correspond-ing benefits. Great challenges exist in broad learning for theeffective fusion of relevant knowledge across different alignedinformation sources depends upon not only the relatednessof these information sources, but also the target applica-tion problems. Broad learning aims at developing generalmethodologies, which will be shown to work for a diverseset of applications, while the specific parameter settings canbe learned for each application from the training data.

Broad learning is a challenging problem. We categorize itsmain challenges into two main groups as follows:

• How to Fuse: The data fusion strategy is highly depen-dent on the data types, and different data categoriesof data may required different fusion methods. For in-stance, for the fusion of image sources about the sameentities, a necessary entity recognition step is required;to combine multiple online social networks, inferenceof the potential anchor link mappings the shared usersacross networks will be key task; meanwhile, to fuse di-verse textual data, concept entity extraction or topicmodeling can both be the potential options. In manycases, the fusion strategy is also correlated with thespecific applications to be studied, which may pose ex-tract constraints or requirements on the fusion results.More information about related data fusion strategiesof online social networks will be introduced later inSection 4.

• How to Mine: To mine the fused data sources, therealso exist many great challenges. In many of the cases,not all the data sources will be helpful for certain appli-

1

arX

iv:1

804.

0987

4v4

[cs

.SI]

2 M

ay 2

018

cation tasks. For instance, in social community detec-tion, the fused information about the users’ credit cardtransaction will have less correlation with the socialcommunities formed by the users. On the other hand,the information diffusion among users is regarded asirrelevant with the information sources depicting thedaily commute routes of people in the real world. Amongall these fused data sources, picking the useful ones isnot an easy task. Several strategies, like feature se-lection [146], meta path weighting [145; 139], networksampling [128] and information source embedding [125;135], will be described in the application tasks to beintroduced in Sections 5-8 respectively.

In this paper, we will focus on introducing the broad learn-ing research works done based on online social media data.Nowadays, to enjoy more social network services, people areusually involved in multiple online social networks simulta-neously, such as Facebook, Twitter and Foursquare [146; 51].Individuals usually have multiple separate accounts in dif-ferent social networks, and discovering the correspondencebetween accounts of the same user (i.e., network alignmentor user anchoring) [140; 141; 51; 133; 138; 126] will be aninteresting problem. What’s more, network alignment isalso the crucial prerequisite step for many interesting inter-network synergistic knowledge discovery applications, like(1) inter-network link prediction/recommendation [136; 146;128; 129; 138; 126; 39; 142; 130], (2) mutual communitydetection [137; 40; 139; 87; 127; 143], (3) cross-platforminformation diffusion [121; 120; 145], and (4) multiple net-works synergistic embedding [125; 135]. These applicationtasks are fundamental problems in social network studies,which together with the network alignment problem willform the backbone of the multiple social network broadlearning ecosystem.

This paper will cover five strongly correlated research direc-tions in the study of broad learning on multiple online socialnetworks:

• Network Alignment: users nowadays are usually in-volved in multiple online social networks simultane-ously. Identifying the common users shared by differ-ent online social networks can effectively combine thesenetworks together, which will also provide the oppor-tunity to study users’ social behaviors from a morecomprehensive perspective. Many research works haveproposed to align the online social networks togetherby inferring the mappings of the shared users betweendifferent networks, which will be introduced in greatdetail in this paper.

• Link Prediction: users’ friendship connections in dif-ferent networks have strong correlations. With the so-cial activity data across multiple aligned social net-works, we can acquire more comprehensive knowledgeabout users and their personal social preferences andhabbits. We will introduce the existing research workson the socail link prediction problem across multiplealigned social sites simultaneously.

• Community Detection: information available acrossmultiple aligned social networks provides more com-plete signals revealing the social community structuresformed by people in the real world. We will introduce

the existing research works on community detectionwith knowledge fused from multiple aligned heteroge-neous social networks as the third task.

• Information Diffusion: the formulation of multi-ple aligned heterogeneous social network provides re-searchers with the opportunity to study the informa-tion diffusion process across different social sites. Thelatest research papers on information diffusion prob-lem across multiple aligned networks will be illustratedas well.

• Network Embedding, information from other alignednetworks can provide complimentary information forrefining the feature representations of users effectively.In recent years, some research papers introduce thesynergistic network embedding across aligned socialnetworks, where knowledge from other external net-works can effectively be utilized in their representationlearning process mutually.

The remainder parts of this paper will be organized as fol-lows. We will first provide the basic terminology defini-tions in Section 2. Via the anchor links, we will introducethe inter-network meta path concept in Section 3, whichwill be extensively used in the following sections. The net-work alignment research papers will be introduced in Sec-tion 4. Inter-network link prediction and friend recommen-dation will be talked about in Section 5. A detailed reviewabout cross-network community detection will be providedin Section 6. Broad learning based information diffusion isintroduced in Section 7 and network embedding works areavailable in Section 8. Finally, we will illustrate several po-tential future development directions about broad learningand conclude this paper in Section 9.

2. TERMINOLOGY DEFINITIONOnline social networks (OSNs) denote the online platformswhich allow people to build social connections with otherpeople, who share similar personal or career interests, back-grounds, and real-life connections. Online social networkingsites vary greatly and each category of online social networkscan provide a specific type of featured services. For instance,Facebook1 allows users to socialize with each other via mak-ing friends, posting text, sharing photos/videos; Twitter2

focuses on providing micro-blogging services for users towrite/read the latest news and messages; Foursquare3 is alocation-based social network offering location-oriented ser-vices; and Instagram4 is a photo and video sharing socialsite among friends or to the public. To enjoy different kindsof social networks services simultaneously, users nowadaysare usually involved in many of these online social networksaforementioned at the same time, in each of which theywill form separate social connections and generate a largeamount of social information.

Generally, the online social networks can be represented asgraphs in mathematics. Besides the users, there usually ex-ist many other types of information entities, like posts, pho-tos, videos and comments, generated by users’ online so-cial activities. Information entities in online social networks1https://www.facebook.com2https://twitter.com3https://foursquare.com4http://instagram.com

2

are extensively connected, and the connections among dif-ferent types of nodes usually have different physical mean-ings. The diverse nodes and connections render the onlinesocial networks a very complex graph structure. Meanwhile,depending on categories of information entities and connec-tions involved, the online social networks can be divided intodifferent types, like homogeneous network, bipartite networkand heterogeneous network. To model the phenomenon thatusers are involved multiple networks, a new concept called“multiple aligned heterogeneous social networks” [146; 51]has been proposed in recent years.

For the networks with simple structures, like the homoge-neous networks merely involving users and friendship links,the social patterns in them are usually easy to study. How-ever, for the networks with complex structures, like the het-erogeneous networks, the nodes can be connected by differ-ent types of link, which will have totally different physicalmeanings. One general technique for heterogeneous net-work studies is “meta path” [98; 146], which specificallydepicts certain link-sequence structures connecting node de-fined based on the network schema. The meta path conceptcan also been extended to the multiple aligned social net-work scenario as well, which can connect the node acrossdifferent social networks.

Given a network G = (V, E), we can represent the set ofnode and link types involved in the network as sets N and Rrespectively. Based on such information, the social networkconcept can be formally defined based on the graph conceptby adding the mappings indicating the node and link typeinformation.

Definition 1. (Social Networks): Formally, a heteroge-neous social network can be represented as G = (V, E , φ, ψ),where V, E are the sets of nodes and links in the network,and mappings φ : V → N , ψ : E → R project the nodes andlinks to their specific types respectively. In many cases, themappings φ, ψ are omitted assuming that the node and linktypes are known by default.

In the following parts of this paper, depending on the cate-gories of information involved in the online social networks,we propose to categorize the online social networks into threegroups: homogeneous social networks, heterogeneous socialnetworks and aligned heterogeneous social networks. Severalimportant concepts about social networks that will be usedthroughout this paper will be introduced as follows.

2.1 Homogeneous Social Network

Definition 2. (Homogeneous Social Network): For a on-line social network G, if there exists one single type of nodesand links in the network (i.e., |N | = |R| = 1), then thenetwork is called a homogeneous social network.

Besides the online social networks involving users and friend-ship links only, many different types of network structurescan also be represented as the homogeneous networks actu-ally. Several representative examples include company inter-nal organizational network involving employees and manage-ment relationships, and computer networks involving PCsand their networking connections. Homogeneous networksare one of the simplest network structure, analysis of whichcan provide many basic knowledge for studying networkswith more complex structures.

Given a homogeneous social network G = (V, E) with userset V and social relationship set E , depending on whetherthe links in G are directed or undirected, the social linkcan denote either the follow links or friendship links amongindividuals. Given an individual user u ∈ V in a undirectedfriendship social network, the set of users connected to ucan be represented as the friends of user u in the networkG, denoted as Γ(u) ⊂ V = v|v ∈ V ∧ (u, v) ∈ E. Thenumber of friends that user u has in the network is alsocalled the degree of node u, i.e., |Γ(u)|.Meanwhile, in a directed network G, the set individuals fol-lowed by u (i.e., Γout(u) = v|v ∈ V ∧ (u, v) ∈ E) are calledthe set of followees of u; and the set of individuals that fol-low u (i.e., Γout(u) = v|v ∈ V ∧ (v, u) ∈ E) are called theset of followers of u. The number of users who follow u iscalled the in-degree of u, and the number of users followedby u is called the out-degree of u in the network. For theusers with large out-degrees, they are called the hubs [49]in the network; while those with large in-degrees, they arecalled the authorities [49] in the network.

2.2 Heterogeneous Social Network

Definition 3. (Heterogeneous Social Network): For aonline social network G, if there exists multiple types ofnodes or links in the network (i.e., |N | > 1, or |R| > 1),then the network is called a heterogeneous social network.

Most of the graph-structured networks in the real worldmay contain very complex information involving multipletypes of nodes and connections. Representative examplesinclude heterogeneous social networks involving users, posts,check-ins, words and timestamps, as well as the friendshiplinks, write links and contain links among these nodes; bibli-ographic network including authors, papers, conferences andthe write, cite, and publish-in links among them; and movieknowledge libraries containing movies, casts, reviewers, re-views and ratings, as well as the complex links among thesenodes. The neighbor, degree, hub and authority conceptsintroduced before for the homogeneous networks can be ap-plied to the heterogeneous networks as well.

Formally, the online social network mentioned above can bedefined as G = (V, E), where V denotes the set of nodes andE represent the set of links in G. The node set V can bedivided into several subsets V = U ∪P∪L∪T ∪W involvingthe user nodes, post nodes, location nodes, word nodes andtimestamp nodes respectively. The link set E can be dividedinto several subsets as well, E = Eu,u∪Eu,p∪Ep,l∪Ep,w∪Ep,t,containing the links among users, the links between usersand posts, and those between posts with location checkins,words, and timestamps.

In the heterogeneous social networks, each node can be con-nected with a set of nodes belonging to different categoriesvia various type of connections. For example, given a useru ∈ U , the set of user node incident to u via the friend linkscan be represented as the online friends of u, denoted as setv|v ∈ U , (u, v) ∈ Eu,u; the set of post node incident to uvia the write links can be represented as the posts writtenby u, denoted as set w|w ∈ P, (u,w) ∈ Eu,p. The locationcheck-in nodes, word nodes and timestamp nodes are not di-rectly connected to the user node, while via the post nodes,we can also obtain the set of locations/words/timestampsthat are visited/used/active-at by user u in the network.

3

Such a indirect connection can be described more clearly bythe meta path concept more clearly in Section 3.

2.3 Aligned Heterogeneous Social Networks

Definition 4. (Multiple Aligned Heterogeneous Networks):Formally, the multiple aligned heterogeneous networks in-volving n networks can be defined as G = ((G(1), G(2), · · · , G(n)),

(A(1,2),A(1,3), · · · ,A(n−1,n))), where G(1), G(2), · · · , G(n) de-

note these n heterogeneous social networks and the sets A(1,2),A(1,3), · · · ,A(n−1,n) represent the undirected anchor linksaligning these networks respectively.

Anchor links actually refer to the mappings of informationentities across different sources, which correspond to the thesame information entity in the real world, e.g., users in on-line social networks, authors in different bibliographic net-works, and movies in the movie knowledge libraries.

Definition 5. (Anchor Link): Given two heterogeneous

networks G(i) and G(j) which share some common infor-mation entities, the set of anchor links connecting G(i) and

G(j) can be represented as set A(i,j) = (u(i)m , u

(j)n )|u(i)

m ∈V(i) ∧ u(j)

n ∈ V(j) ∧ u(i)m , u

(j)n denote the same information

entity.

The anchor links depict a transitive relationship among theinformation entities across different networks. Given 3 infor-mation entities u

(i)m , u

(j)n , u

(k)o from networks G(i), G(j) and

G(k) respectively, if u(i)m , u

(j)n are connected by an anchor

link and u(j)n , u

(k)o are connected by an anchor link, then the

user pair u(i)m , u

(k)o will be connected by an anchor link by

default. For more detailed definitions about other relatedterms, like anchor users, non-anchor users, full alignment,partial alignment and non-alignment, please refer to [146].

3. META PATHTo deal with the social networks, especially the heteroge-neous social networks, a very useful tool is meta paths [98;146]. Meta path is a concept defined based on the networkschema, outlining the connections among nodes belongingto different categories. For the nodes which are not directlyconnected, their relationships can be depicted with the metapath concept. In this part, we will define the meta path con-cept, and introduce a set of meta paths within and acrossreal-world heterogeneous social networks respectively.

3.1 Network SchemaGiven a network G = (V, E), we can define its correspondingnetwork schema to describe the categories of nodes and linksinvolved in G.

Definition 6. (Network Schama): Formally, the networkschema of network G can be represented as SG = (N ,R),where N and R denote the node type set and link type setof network G respectively.

Network schema provides a meta level description of net-works. Meanwhile, if a network G can be outlined by thenetwork schema SG, G is also called a network instance ofthe network schema. For a given node u ∈ V, we can repre-sent its corresponding node type as φ(u) = N ∈ N , and call

u is an instance of node type N , which can also be denotedas u ∈ N for simplicity. Similarly, for a link (u, v), we candenotes its link type as ψ((u, v)) = R ∈ R, or (u, v) ∈ Rfor short. The inverse relation R−1 denotes a new link typewith reversed direction. Generally, R is not equal to R−1,unless R is symmetric.

3.2 Meta Path in Heterogeneous Social Net-works

Meta path is a concept defined based on the network schemadenoting the correlation of nodes based on the heterogeneousinformation (i.e., different types of nodes and links) in thenetworks.

Definition 7. (Meta Path): A meta path P defined basedon the network schema SG = (N ,R) can be represented as

P = N1R1−−→ N2

R2−−→ · · ·Nk−1

Rk−1−−−−→ Nk, where Ni ∈ N , i ∈1, 2, · · · , k and Ri ∈ R, i ∈ 1, 2, · · · , k − 1.

Furthermore, depending on the categories of node and linktypes involved in the meta path, we can specify the metapath concept into several more refined groups, like homo-geneous meta path and heterogeneous meta path, or socialmeta path and other meta paths.

Definition 8. (Homogeneous/Heterogeneous Meta Path):

Let P = N1R1−−→ N2

R2−−→ · · ·Nk−1

Rk−1−−−−→ Nk denote a metapath defined based on the network schema SG = (N ,R). Ifall the node types and link types involved in P are of thesame category, P is called a homogeneous meta path; other-wise, P is called a heterogeneous meta path.

The meta paths can connect any kinds of node type pairs,and specifically, for the meta paths starting and ending withthe user node types, those meta paths are called the socialmeta paths.

Definition 9. (Social Meta Path): Let P = N1R1−−→

N2R2−−→ · · ·Nk−1

Rk−1−−−−→ Nk denote a meta path defined basedon the network schema SG = (N ,R). If the starting andending node types N1 and Nk are both the user node type,P is called a social meta path.

Users are usually the focus in social network studies, andthe social meta paths are frequently used in both researchand real-world applications and services. If all the nodetypes in the meta paths are all user node type and the linktypes are also of a common category, then the meta pathis called the homogeneous social meta path. The numberof path segments in the meta path is called the meta path

length. For instance, the length of meta path P = N1R1−−→

N2R2−−→ · · ·Nk−1

Rk−1−−−−→ Nk is k − 1. Meta paths can alsobeen concatenated together with the meta path compositionoperator.

Definition 10. (Meta Path Composition): Meta paths

P 1 = N11

R11−−→ N1

2

R12−−→ · · ·N1

k−1

R1k−1−−−−→ N1

k , and P 2 =

N21

R21−−→ N2

2

R22−−→ · · ·N2

l−1

R2l−1−−−→ N1

l can be concatenated

together to form a longer meta path P = P 1 P 2 = N11

R11−−→

· · ·R1k−1−−−−→ N1

k

R21−−→ N2

2

R22−−→ · · ·N2

l−1

R2l−1−−−→ N1

l , if the ending

4

node type of P 1 is the same as the starting node type of P 2,i.e., N1

k = N21 . The new composed meta path is of length

k + l − 2.

Meta path P = N1R1−−→ N2

R2−−→ · · ·Nk−1

Rk−1−−−−→ Nk canalso been treated as the concatenation of simple meta paths

N1R1−−→ N2, N2

R2−−→ N3, · · · , Nk−1

Rk−1−−−−→ Nk, which can berepresented as P = R1 R2 · · · Rk−1 Rk.

3.3 Meta Path across Aligned HeterogeneousSocial Networks

Besides the meta paths within one single heterogeneous net-work, the meta paths can also be defined across multiplealigned heterogeneous networks via the anchor meta paths.

Definition 11. (Anchor Meta Path): Let G(1) and G(2)

be two aligned heterogeneous networks sharing the commonanchor information entity of types N (1) ∈ N (1) and N (2) ∈N (2) respectively. The anchor meta path between the schemasof networks G(1) and G(2) can be represented as meta path

Φ = N (1) Anchor←−−−−→ N (2) of length 1.

The anchor meta path is the simplest meta path acrossaligned networks, and a set of inter-network meta paths canbe defined based on the intra-network meta paths and theanchor meta path.

Definition 12. (Inter-Network Meta Path): A meta path

Ψ = N1R1−−→ N2

R2−−→ · · ·Nk−1

Rk−1−−−−→ Nk is called an inter-network meta path between networks G(1) and G(2) iff ∃m ∈1, 2, · · · , k − 1, Rm = Anchor.

The inter-network meta paths can be viewed as a composi-tion of intra-network meta paths and the anchor meta pathvia the user node types. An inter-network meta path canbe a meta path starting with an anchor meta path followedby the intra-network meta paths, or those with anchor metapaths in the middle. Here, we would like to introduce sev-eral categories inter-network meta paths involving the an-chor meta paths at different positions as defined in [146]:

• Ψ(G(1), G(2)) = Φ(G(1), G(2)), which denotes the sim-plest inter-network meta path composed of the anchormeta path only between networks G(1) and G(2).

• Ψ(G(1), G(2)) = Φ(G(1), G(2))P (G(1)), which denotesthe inter-network meta path starting with an anchormeta path and followed by the intra-network socialmeta path in network G(2).

• Ψ(G(1), G(2)) = P (G(1))Φ(G(1), G(2)), which denotesthe inter-network meta path starting with the intra-network social meta path in network G(1) followed byan anchor meta path between networks G(1) and G(2).

• Ψ(G(1), G(2)) = P (G(1))Φ(G(1), G(2))P (G(2)), whichdenotes the inter-network meta path starting and end-ing with the intra-network social meta path in networksG(1) and G(2) respectively connected by an anchormeta path between networks G(1) and G(2).

• Ψ(G(1), G(2)) = P (G(1)) Φ(G(1), G(2)) P (G(2)) Φ(G(2), G(1)), which denotes the inter-network metapath starting and ending with node types in networkG(1) and traverse across the networks twice via theanchor meta path.

• Ψ(G(1), G(2)) = P (G(1)) Φ(G(1), G(2)) P (G(2)) Φ(G(2), G(1))P (G(1)), which denotes the inter-networkmeta path starting and ending with the intra-networksocial meta paths in network G(1) and traverse acrossthe networks twice via the anchor meta path betweenthem.

These meta path concepts introduced in this section will bewidely used in various social network broad learning tasksto be introduced later.

4. NETWORK ALIGNMENTNetwork alignment is an important research problem anddozens of papers have been published on this topic in thepast decades. Depending on specific disciplines, the stud-ied networks can be social networks in data mining [140;141; 51; 133; 138; 126] protein-protein interaction (PPI)networks and gene regulatory networks in bioinformatics[41; 90; 60; 93], chemical compound in chemistry [95], dataschemas in data warehouse [68], ontology in web semantics[24], graph matching in combinatorial mathematics [66], aswell as graphs in computer vision [19; 7].

In bioinformatics, the network alignment problem aims atpredicting the best mapping between two biological net-works based on the similarity of the molecules and theirinteraction patterns. By studying the cross-species vari-ations of biological networks, network alignment problemcan be applied to predict conserved functional modules [88]and infer the functions of proteins [76]. Graemlin [30] con-ducts pairwise network alignment by maximizing an objec-tive function based on a set of learned parameters. Someworks have been done on aligning multiple network in bioin-formatics. IsoRank proposed in [94] can align multiple net-works greedily based on the pairwise node similarity scorescalculated with spectral graph theory. IsoRankN [60] furtherextends IsoRank by exploiting a spectral clustering schemein the alignment model.

In recent years, with rapid development of online social net-works, researchers’ attention starts to shift to the alignmentof social networks. Enlightened by the homogeneous net-work alignment method in [106], Koutra et al. [54] proposeto align two bipartite graphs with a fast alignment algo-rithm. Zafarani et al. [118] propose to match users acrosssocial networks based on various node attributes, e.g., user-name, typing patterns and language patterns etc. Kong etal. formulate the heterogeneous social network alignmentproblem as an anchor link prediction problem. A two-stepsupervised method MNA is proposed in [51] to infer poten-tial anchor links across networks with heterogeneous infor-mation in the networks. However, social networks in thereal world are mostly partially aligned actually and lots ofusers are not anchor users. Zhang et al. have proposed apartial network alignment method specifically in [133].

In the social network alignment model building, the anchorlinks are very expensive to label manually, and achieving alarge-sized anchor link training set can be extremely chal-lenging. In [138], Zhang et al. propose to study the net-work alignment problem based on the PU (Positive and Un-labeled) learning setting instead, where the model is builtbased on a small amount of positive set and a large unla-beled set. Furthermore, in the case when no training data isavailable, via inferring the potential anchor user mappingsacross networks, Zhang et al. have introduced an unsuper-

5

vised network alignment models for multiple (more than 2)social networks in [140] and an unsupervised network con-current alignment model via multiple shared information en-tities simultaneously in [141].

In this section, we will introduce the social network align-ment methods based on the supervised learning, unsuper-vised learning and semi-supervised learning settings respec-tively.

4.1 Supervised Network AlignmentFormally, let G(1) = (V(1), E(1)) and G(2) = (V(2), E(2)) de-

note two online social networks, where V(1)/V(2) and E(1)/E(2)

denote the sets of nodes and links involved in these two net-works respectively. Let set Atrain denotes the set of labeledanchor links connecting networks G(1) and G(2), we can rep-resent the set of anchor links without known labels as thetest set Atest ⊆ U (1) × U (2) \ Atrain.

In the supervised network alignment problem, a set of fea-tures will be extracted for the anchor links with the het-erogeneous information available across the social networks.Meanwhile, the existing and non-existing anchor links willbe labeled as positive and negative instances respectively.Based on the training set Atrain, we can represent the fea-ture vectors and labels of links in the set as a group of tuples(xl, yl)l∈Atrain , where xl represents the feature vector ex-tracted for anchor link l and yl ∈ −1,+1 denotes its label.Based on the training set, we aim at building a mappingf : Atest → −1,+1 to determine the labels of the anchorlinks in the test set. To address the problem, we will takethe supervised network alignment model proposed in [51] asan example to illustrate the problem setting and potentialsolutions.

4.1.1 Anchor Link Feature ExtractionThe supervised network alignment model proposed in [51]involves three main phases: (1) feature extraction, (2) clas-sification model building, and (3) network matching. One ofthe main goal in supervised network alignment is to extractdiscriminative social features for a pair of user accounts be-tween two disjoint social networks. Intuitively, the socialneighbors of each user account can only involve users fromthe same social network, which will have no common neigh-bors actually. For example, the neighbors for a Facebookuser will only involve the other users in Facebook, whichhas no overlap with his neighbors in Twitter (which con-tains the Twitter users only). However, in anchor link pre-diction problem, we need to extract a set of features for theanchor links between two different networks, which can bea challenging problem. In the following, we will introduceseveral social features proposed in [51] for the multi-networksettings specifically.

Let (u(1)i , u

(2)j ) be a potential anchor link between these two

networks, and A+train ⊂ Atrain be the set of positively la-

beled anchor links in the training set. [51] proposes to ex-tend the definition of some commonly used social featuresin link prediction, i.e., “common neighbors”, “Jaccard’s co-efficient” and “Adamic/Adar measure”, to extract effectivefeatures for these anchor links based on the known anchorlinks in set A+

train.

Extended Common Neighbor

The extended common neighbor (ECN) CN(u(1)i , u

(2)j ) rep-

resents the number of ‘common’ neighbors between u(1)i in

network G(1) and u(2)j in network G(2). We denote the neigh-

bors of u(1)i in network G(1) as Γ(u

(1)i ), and the neighbors of

u(2)j in network G(2) as Γ(u

(2)j ). It is easy to identify that

the sets Γ(u(1)i ) and Γ(u

(2)j ) contain the users from two dif-

ferent networks respectively, which are isolated without anycommon entries.

Meanwhile, based on the existing anchor links A+train, some

of the users in Γ(u(1)i ) and Γ(u

(2)j ) can correspond to the ac-

counts of the same users in these two networks, who are actu-ally connected by the anchor links in A+

train. Based on suchan intuition, [51] defines the extended common neighbormeasure between these two users as the number of sharedanchor users in their neighbor sets respectively.

Definition 13. (Extended Common Neighbor): The mea-sure of extended common neighbor is defined as the number

of known anchor links between Γ(u(1)i ) and Γ(u

(2)j ).

ECN(u(1)i , u

(2)j ) =

∣∣∣(u(1)p , u

(2)q )|(u(1)

p , u(2)q ) ∈ A+

train, (1)

u(1)p ∈ Γ(u

(1)i ), u

(2)q ∈ Γ(u

(2)j )

∣∣∣ (2)

=

∣∣∣∣∣∣∣Γ(u(1)i )

⋂A+train

Γ(u(2)j )

∣∣∣∣∣∣∣ . (3)

Extended Jaccard’s Coefficient

[51] also extends the measure of Jaccard’s coefficient to multi-network setting using similar method of extending common

neighbor. EJC(u(1)i , u

(2)j ) is a normalized version of com-

mon neighbors, i.e., ECN(u(1)i , u

(2)j ) divided by the total

number of distinct users in Γ(u(1)i ) ∪ Γ(u

(2)j )

Definition 14. (Extended Jaccard’s Coefficient): Given

the neighborhood set of users u(1)i and u

(2)j in networks G(1)

and G(2) respectively, the Extended Jaccard’s Coefficient of

user pair u(1)i and u

(2)j can be represented as

EJC(u(1)i , u

(2)j ) =

∣∣∣∣Γ(u(1)i )

⋂A+train

Γ(u(2)j )

∣∣∣∣∣∣∣∣Γ(u(1)i )

⋃A+train

Γ(u(2)j )

∣∣∣∣ , (4)

where ∣∣∣∣∣∣∣Γ(u(1)i )

⋃A+train

Γ(u(2)j )

∣∣∣∣∣∣∣ (5)

= |Γ(u(1)i )|+ |Γ(u

(2)j )| −

∣∣∣∣∣∣∣Γ(u(1)i )

⋂A+train

Γ(u(2)j )

∣∣∣∣∣∣∣ . (6)

Extended Adamic/Adar Index

Similarly, [51] also extends the Adamic/Adar Measure intomulti-network settings, where the common neighbors areweighted by their average degrees in both social networks.

Definition 15. (Extended Adamic/Adar Index): The Ex-

tended Adamic/Adar Index of the user pairs u(1)i and u

(2)j

across networks can be represented as

EAA(u(1)i , u

(2)j ) (7)

=∑

(u(1)p ,u

(2)q )∈Γ(u

(1)i

)⋂A+train

Γ(u(2)j

)

log−1

|Γ(u(1)p )| + |Γ(u

(2)q )|

2

. (8)

6

In the EAA definition, for the common neighbor shared by

u(1)i and u

(2)j , their degrees are defined as the average of their

degrees in networks G(1) and G(2). Considering that differ-ent networks are of different scales, like Twitter if far largerthan Twitter, the node degree measure can be dominatedby the degree of the larger networks. Some other weighted

form of the degree measure, like α·|Γ(u(1)p )|+(1−α)·|Γ(u

(2)q )

(α ∈ [0, 1]), can be applied to replace|Γ(u

(1)p )|+|Γ(u

(2)q )|

2in the

definition.

In addition to the social features mentioned above, hetero-geneous social networks also involve abundant informationabout: where, when and what. A number of features ex-tracted by exploiting the spatial, temporal and text contentinformation can also be extracted to facilitate anchor linkprediction, which have been introduced in detail in [51].

4.1.2 Anchor Link Model BuildingGiven the multiple aligned social networks, via manual la-beling, the sets of identified existing and non-existing anchorlinks can be denoted as A+

train and A−train respectively. Theanchor links in sets A+

train and A−train are assigned with thepositive and negative labels respectively, i.e., −1,+1, de-pending on whether they exist or not. For instance, givena link l ∈ A+

train, it will be associated with a positive label,i.e., yl = +1; while if link l ∈ A−train, it will be associ-ated with a negative label, yl = −1. With the informationin these aligned heterogeneous social networks, a set of fea-tures introduced in the previous subsection can be extractedfor the links in sets A+

train and A−train. For instance, for alink l in the training set A+

train (or A−train), we can representits feature vector as xl, which will be called an anchor linkinstance and each feature is an attribute of the anchor link.With these anchor link instances and their labels, a classifi-cation model, like SVM (support vector machine), DecisionTree, or neural networks, can be trained. Meanwhile, in itstest procedure, for each link l in the test set Atest, a similarset of features (or attributes) can be extracted, which canbe denoted as its feature vector as xl. However, withoutknowledge about its label, the main objective of Step (2) isto determine whether the potential anchor links in set Atestexists or not (its label is positive or negative). By apply-ing the trained to the feature vector of the anchor link, wewill obtain a prediction label, which will be returned as theresult of Step (2).

4.1.3 Network MatchingHowever, in the inference process, the predictions of thebinary classifier cannot be directly used as anchor links dueto the following issues:

• The inference of conventional classifiers are designedfor constraint-free settings, and the one-to-one con-straint [51; 126] on anchor links may not necessarilyhold in the label prediction of the classifier (SVM).

• Most classifiers also produce output scores, which canbe used to rank the data points in the test set. How-ever, these ranking scores are uncalibrated in scale toanchor link prediction task. Previous classifier calibra-tion methods [117] apply only to classification prob-lems without any constraint.

In order to tackle the above issues, [51] introduces an in-ference process, called MNA (Multi-Network Anchoring), to

0.80.6

0.1

0.4

Network Network

u(1)1

u(1)2 u

(2)2

u(2)1

G(1)G(2)

(a) input/ranking scores

Network NetworkG(1)G(2)

1

2

1

2

0.80.6

0.1

0.4

sourcenetwork

targetnetwork

1

2

1

2

sourcenetwork

targetnetwork

1

2

1

2

sourcenetwork

targetnetwork

1

2

1

2

sourcenetwork

targetnetwork

a

c

a

d

b

c

(b) link prediction

Network NetworkG(1) G(2)

1

2

1

2

0.80.6

0.1

0.4

sourcenetwork

targetnetwork

1

2

1

2

sourcenetwork

targetnetwork

1

2

1

2

sourcenetwork

targetnetwork

1

2

1

2

sourcenetwork

targetnetwork

a

c

a

d

b

c

(c) maximize sum ofweights (1:1 constrained)

Network NetworkG(1) G(2)

1

2

1

2

0.80.6

0.1

0.4

sourcenetwork

targetnetwork

1

2

1

2

sourcenetwork

targetnetwork

1

2

1

2

sourcenetwork

targetnetwork

1

2

1

2

sourcenetwork

targetnetwork

a

c

a

d

b

c

(d) MNA method

Figure 1: An example of anchor link inference by differentmethods. (a) is the input, ranking scores. (b)-(d) are theresults of different methods for anchor link inference.

infer anchor links based upon the ranking scores of the classi-fier. This model is motivated by the stable marriage problem[26] in mathematics.

We first use a toy example in Figure 1 to illustrate themain idea of MNA. Suppose in Figure 1(a), we are giventhe ranking scores from the classifiers, between the 4 userpairs two networks (i.e., network G(1) and network G(2)).We can see in Figure 1(b) that link prediction methods witha fixed threshold may not be able to predict well, becausethe predicted links do not satisfy the constraint of one-to-one relationship. Thus one user account in network G(1)

can be linked with multiple accounts in network G(2). InFigure 1(c), weighted maximum matching methods can finda set of links with maximum sum of weights. However, it isworth noting that the input scores are uncalibrated, so themaximum weight matching may not be a good solution foranchor link prediction problems. The input scores only in-dicate the ranking of different user pairs, i.e., the preferencerelationship among different user pairs.

Here we say ‘node x prefers node y over node z’, if thescore of pair (x, y) is larger than the score of pair (x, z). Forexample, in Figure 1(c), the weight of pair a, i.e., Score(a) =

0.8, is larger than Score(c) = 0.6. It shows that user u(1)1

(the first user in network G(1)) prefers u(2)1 over u

(2)2 . The

problem with the prediction result in Figure 1(c) is that, the

pair (u(1)1 , u

(2)1 ) should be more likely to be an anchor link

due to the following reasons: (1) u(1)1 prefers u

(2)1 over u

(2)2 ;

(2) u(2)1 also prefers u

(1)1 over u

(1)2 .

By following such an intuition, we can obtain the final stable

matching result in Figure 1(d), where anchor links (u(1)1 , u

(2)1 )

and (u(1)2 , u

(2)2 ) are selected in the matching process.

Definition 16. (Matching): Mapping µ : U (1) ∪ U (2) →U (1) ∪ U (2) is defined to be a matching iff (1) |µ(ui)| =

1,∀ui ∈ U (1) and µ(ui) ∈ U (2); (2) |µ(vj)| = 1, ∀vj ∈ U (2)

and µ(vj) ∈ U (1); (3) µ(ui) = vj iff µ(vj) = ui.

7

Algorithm 1 Multi-Network Stable Matching

Input: two heterogeneous social networks, Gs and Gt.a set of known anchor links A

Output: a set of inferred anchor links A′1: Construct a training set of user account pairs with

known labels using A.2: For each pair (usi , u

tj), extract four types of features.

3: Training classification model C on the training set.4: Perform classification using model C on the test set.5: For each unlabeled user account, sort the ranking scores

into a preference list of the matching accounts.6: Initialize all unlabeled usi in Gs and utj in Gt as free7: A′ = ∅8: while ∃ free usi in Gs and usi ’s preference list is non-

empty do9: Remove the top-ranked account utj from usi ’s prefer-

ence list10: if utj is free then11: A′ = A′ ∪ (usi , utj)12: Set usi and utj as occupied13: else14: ∃usp that utj is occupied with.15: if utj prefers usi to usp then16: A′ = (A′ − (usp, utj)) ∪ (usi , utj)17: Set usp as free and usi as occupied18: end if19: end if20: end while

Definition 17. (Blocking Pair): A pair (u(1)i , u

(2)j ) is a

blocking pair iff u(1)i and u

(2)j both prefer each other over

their current assignments respectively in the predicted set ofanchor links A′.

Definition 18. (Stable Matching): An inferred anchorlink set A′ is stable if there is no blocking pair.

Based on the result from the previous step, the MNA methodintroduced in [51] formulates the anchor link pruning prob-lem as a stable matching problem between user accounts innetwork G(1) and accounts in network G(2). Assume that wehave two sets of unlabeled user accounts, i.e., U (1) in network

G(1) and U (2) in network G(2). Each user u(1)i has a ranking

list or preference list P (u(1)i ) over all the user accounts in

network G(2) (u(2)j ∈ U (2)) based upon the input scores of

different pairs. For example, in Figure 1(a), the preference

list of node u(1)1 is P (u

(1)1 ) = (u

(2)1 > u

(2)2 ), indicating that

node u(2)1 is preferred by u

(1)1 over u

(2)2 . The preference list

of node u(1)2 is also P (u

(1)2 ) = (u

(2)1 > u

(2)2 ). Similarly, we

also build a preference list for each user account in network

G(2). In Figure 1(a), P (u(2)1 ) = P (u

(2)2 ) = (u

(1)1 > u

(1)2 ).

The proposed MNA method for anchor link prediction isshown in Algorithm 1. In each iteration, MNA first ran-

domly selects a free user account u(1)i from network G(1).

Then MNA gets the most preferred user node u(2)j by u

(1)i

in its preference list P (u(1)i ). The most preferred user u

(2)j

will be removed from the preference list, i.e., P (u(1)i ) =

P (u(1)i ) − u

(2)j . If u

(2)j is also a free account, MNA will

add the pair of accounts (u(1)i , u

(2)j ) into the current solu-

tion set A′. Otherwise, u(2)j is already occupied with u

(1)p in

A′. MNA then examines the preference of u(2)j . If u

(2)j also

prefers u(1)i over u

(1)p , it means that the pair (u

(1)i , u

(2)j ) is

a blocking pair. MNA removes the blocking pair by replac-

ing the pair (u(1)p , u

(2)j ) in the solution set A′ with the pair

(u(1)i , u

(2)j ). Otherwise, if u

(2)j prefers u

(1)p over u

(1)i , MNA

will start the next iteration to reach out the next free nodein network G(1). The algorithm stops when all the users innetwork G(1) are occupied, or all the preference lists of freeaccounts in network G(1) are empty.

Finally, the selected anchor links in set A′ will be returnedas the final positive instances, while the remaining ones inthe test set Atest will be labeled as the negative instances.Another variant of the supervised network alignment modelhas been proposed in [133], which adds an extra thresholdon the user preference list to make the matching algorithmapplicable to handle the non-anchor users as well.

4.2 Pairwise Unsupervised Homogeneous Net-work Alignment

In this part, we will study the network alignment problembased on unsupervised learning setting, which needs no la-beled training data. Given two heterogeneous online socialnetworks, which can be represented as G(1) = (V(1), E(1))

and G(2) = (V(2), E(2)) respectively, the unsupervised net-work alignment problem aims at inferring the anchor linksbetween networks G(1) and G(2). Let U (1) ⊂ V(1) andU (2) ⊂ V(2) be the user set in these two networks respec-tively, we can represent the set of potential anchor linksbetween networks G(1) and G(2) as A = U (1) × U (2). In theunsupervised network alignment problem, among all the po-tential anchor links in set A = U (1) ×U (2), we want to inferwhich ones in set A exist in the real world.

Given two homogeneous networks G(1) and G(2), mappingthe nodes between them is an extremely challenging task,which is also called the graph isomorphism problem [82; 31].The graph isomorphism has been shown to be NP, but it isstill not known whether it also belongs to P or NP-completeyet. So far, no efficient algorithm exists that can addressthe problem in polynomial time. In this part, we will intro-duce several heuristics based methods to solve the pairwisehomogeneous network alignment problem.

4.2.1 Heuristic Measure based Network AlignmentModel

The information generated by users’ online social activi-ties can indicate their personal characteristics. The fea-tures introduced in the previous subsection, like ECN, EJCand EAA based on social connection information, similar-ity/distance measures based on location checkin informa-tion, temporal activity closeness, and text word usage sim-ilarity can all be used as the predictors indicating whetherthe cross-network user pairs are the same user or not. Be-sides these measures, in this part, we will introduce a cat-egory new measures, Relative Centrality Difference (RCD),which can also be applied to solve the unsupervised networkalignment problem.

The centrality concept can denote the importance of usersin the online social networks. Here, we assume that impor-tant users in one social network (like celebrities, movie starsand politicians) will be important as well in other networks.Based on such an assumption, the centrality of users in dif-

8

ferent networks can be an important signal for inferring theanchor links across networks.

Definition 19. (Relative Centrality Difference): Given

two users u(1)i , u

(2)j from networks G(1) and G(2) respec-

tively, let C(u(1)i ) and C(u

(2)j ) denote the centrality scores

of the users, we can define the relative centrality difference(RCD) as

RCD(u(1)i , u

(2)j ) =

1 +|C(u

(1)i )− C(u

(2)j )|(

C(u(1)i ) + C(u

(2)j ))/2

−1

. (9)

Depending on the centrality measures applied, different typesof relative centrality difference measures can be defined. Forinstance, if we use node degree as the centrality measure, therelative degree difference can be represented as

RDD(u(1)i , u

(2)j ) =

1 +|D(u

(1)i )−D(u

(2)j )|(

D(u(1)i ) +D(u

(2)j ))/2

−1

. (10)

Meanwhile, if the PageRank scores of the nodes are used todefine their centrality, we can represent the relative central-ity difference measure as

RCD(u(1)i , u

(2)j ) =

1 +|S(u

(1)i )− S(u

(2)j )|(

S(u(1)i ) + S(u

(2)j ))/2

−1

. (11)

In the above equations, D(u) and S(u) denote the node de-gree and page rank score of node u within each networkrespectively.

4.2.2 IsoRankModel IsoRank [94] initially proposed to align the biomedi-cal networks, like protein protein interaction (PPI) networksand gene expression networks, can be used to solve the un-supervised social network alignment problem as well. TheIsoRank algorithm has two stages. It first associates a scorewith each possible anchor links between nodes of the two

networks. For instance, we can denote r(u(1)i , u

(2)j ) as the

reliability score of an potential anchor link (u(1)i , u

(2)j ) be-

tween the networks G(1) and G(2), and all such scores canbe organized into a vector r of length |U (1)| × |U (2)|. In thesecond stage of IsoRank, it constructs the mapping for thenetworks by extracting from r.

Definition 20. (Reliability Score): The reliability score

r(u(1)i , u

(2)j ) of anchor link (u

(1)i , u

(2)j ) is highly correlated

with the support provided by the mapping scores of the neigh-

borhoods of users u(1)i and u

(2)j . Therefore, we can define the

score r(u(1)i , u

(2)j ) as

r(u(1)i , u

(2)j ) (12)

=∑

u(1)m ∈Γ(u

(1)i

)

∑u

(2)n ∈Γ(u

(2)i

)

1

|Γ(u(1)i )||Γ(u

(2)j )|

r(u(1)m , u

(2)n ), (13)

where sets Γ(u(1)i ) and Γ(u

(2)i ) represent the neighborhoods

of users u(1)i and u

(1)i respectively in networks G(1) and G(2).

If the networks are weighted, and all the intra-network con-

nections like (u(1)i , u

(1)m ) will be associated with a weight

w(u(1)i , u

(1)m ), we can represented the reliability measure of

r(u(1)i , u

(2)j ) in the weighted network as

r(u(1)i , u

(2)j ) =

∑u

(1)m ∈Γ(u

(1)i

)

∑u

(2)n ∈Γ(u

(2)i

)

w(u(1)i , u

(2)j )r(u

(1)m , u

(2)n ), (14)

where the weight term

w(u(1)i , u

(2)j ) (15)

=w(u

(1)i , u

(1)m )w(u

(2)j , u

(2)n )∑

u(1)p ∈Γ(u

(1)i

)w(u

(1)i , u

(1)p )

∑u

(2)q ∈Γ(u

(2)i

)w(u

(2)j , u

(2)q

. (16)

As we can see, Equation 12 is a special case of Equation 14

with link weight w(u(1)i , u

(1)j ) = 1 for u

(1)i ∈ U

(1) and u(2)j ∈

U (2). Equation 12 can also be rewritten with linear algebra

r = Ar, (17)

where matrix A ∈ R|U(1)||U(2)|×|U(1)||U(2)| with entry

A((i, j), (p, q)

)(18)

=

1

|Γ(u(1)i

)||Γ(u(2)j

)|, if (u

(1)i , u

(1)p ) ∈ E(1), (u

(2)j , u

(2)q ) ∈ E(2),

0, otherwise.(19)

The matrix A is of dimension |U (1)||U (2)|×|U (1)||U (2)|, wherethe row and column indexes correspond to different potentialanchor links across the networks. The entry A

((i, j), (p, q)

)corresponds the anchor links (u

(1)i , u

(2)j ) and (u

(1)p , u

(2)q ). As

we can see, the above equation denotes a random walk acrossthe graphs G(1) and G(2) via the social links and anchorlinks in them. The solution to the above equation denotesthe principal eigenvector of the matrix A corresponding tothe eigenvalue 1. For more information about the randomwalk model, please refer to [94].

4.2.3 IsoRankNIsoRankN [60] algorithm is an extension to IsoRank. Basedon the learning results of IsoRank, IsoRankN further adoptsthe spectral clustering method on the induced graph of pair-wise alignment scores to achieve the final alignment results.The new approach provides significant advantages not onlyover the original IsoRank but also over other methods. Iso-RankN has 4 main steps: (1) initial network alignment withIsoRank, (2) star spread, (3) spectral partition, and (4) starmerging, where steps (3) and (4) will repeat until all thenodes are assigned to a cluster.

Initial Network Alignment: Given k isolated networksG(1), G(2), · · · , G(k), IsoRankN computes the local alignmentscores of node pairs across networks with IsoRank algorithm.For instance, if the networks are unweighted, the alignment

score between nodes u(i)l and u

(j)m between networks G(i),

G(j) can be denoted as.

r(u(1)i , u

(2)j ) (20)

=∑

u(1)m ∈Γ(u

(1)i

)

∑u

(2)n ∈Γ(u

(2)i

)

1

|Γ(u(1)i )||Γ(u

(2)j )|

r(u(1)m , u

(2)n ), (21)

It will lead to a weighted k-partite graph, where the linksdenotes the anchor links across networks weighted by thescores calculated above. If the networks G(1), · · ·G(k) areall complete graphs, the alignment results will be the max-imum weighted cliques. However, in the real world, suchan assumption can hardly met, and IsoRankN proposes to

9

use “Star Spread” technique to select a subgraph with highweights.

Star Spread: For each node in a network, e.g., u(i)l in

network G(i), the set of nodes connected with u(i)l via po-

tential anchor links can be denoted as set Γ(u(i)l ). The nodes

in Γ(u(i)l ) can be further pruned by removing the nodes

connected with weak anchor links. Here, the “weak” de-notes the anchor links with a low score calculated with Iso-Rank. Formally, among all the nodes in Γ(u

(i)l ), we can

denote the node connected to u(i)l with the strongest link

as v∗ = argv∈Γ(u

(i)l

)max r(u

(i)l , v). For all the nodes with

weights lower than β ·r(u(i)l , v∗) will be removed from Γ(u

(i)l )

(where β is a threshold parameter), and the remaining nodes

together with u(i)l will form a star structured graph S

u(i)l

.

Spectral Partition: For each node u(i)l , IsoRankN aims

at selecting a subgraph S∗u

(i)l

from Su

(i)l

, which contains the

highly weighted neighbors of u(i)l . To achieve such a ob-

jective, IsoRankN proposes to identify a subgraph with lowconductance from S

u(i)l

instead. Formally, given a network

G = (V, E), let S ⊂ V denote a subset of G. The conduc-tance of the subgraph involving S can be represented as

φ(S) =

∑u∈S

∑v∈S wu,v

min(vol(S), vol(S)), (22)

where S = V\S, and vol(S) =∑u∈S

∑v∈V wu,v. IsoRankN

points out that a node subset S containing node u(i)l can be

computed effectively and efficiently with the personalized

PageRank algorithm starting from node u(i)l .

Star Merging: Considering that links in the star graphS∗u

(i)l

are all the anchor links across networks, there exist no

intra-network links at all in S∗u

(i)l

, e.g., the links in network

G(i) only. However, in many cases, there may exist multiplenodes corresponding to the same entity inside the networkas well. To solve such a problem, IsoRankN proposes a starmerging step to combine several star graphs together, e.g.,S∗u

(i)l

and S∗u

(j)m

.

Formally, given two star graphs S∗u

(i)l

and S∗u

(j)m

, if the fol-

lowing conditions both hold, S∗u

(i)l

and S∗u

(j)m

can be merged

into one star graph.

∀v ∈ S∗u

(j)m

\ u(j)m , r(v, u

(i)l ) ≥ β · max

v′∈Γ(u(i)l

)

r(v′, u(i)l ), (23)

∀v ∈ S∗u

(i)l

\ u(i)l , r(v, u

(j)m ) ≥ β · max

v′∈Γ(u(j)m )

r(v′, u(j)m ). (24)

4.2.4 Matrix Inference based Network AlignmentFormally, given a homogeneous network G(1), its structure

can be organized as the adjacency matrix AG(1) ∈ R|U(1)|×|U(1)|.

If network G(1) is unweighted, then matrix AG(1) will be a

binary matrix and entryAG(1)(i, p) = 1 (orAG(1)(u(1)i , u

(1)p ) =

1) iff the correspond social link (u(1)i , u

(1)p ) exists. In the case

that the network is weighted, the entries like AG(1)(i, p) = 1

denotes the weight of link (u(1)i , u

(1)p ) and 0 if (u

(1)i , u

(1)p )

doesn’t exist. In a similar way, we can also represent thesocial adjacency matrix AG(2) for network G(2) as well.

The network alignment problem aims at inferring an one-to-one node mapping function, that can project nodes fromone network to the other networks. For instance, we candenote the mapping between networks G(1) to G(2) as f :U (1) → U (2). Via the mapping f , besides the nodes, thenetwork structure can be projected across networks as well.

For instance, given a social connection (u(1)i , u

(1)p ) in G(1),

we can represent its corresponding connection in G(2) as

(f(u(1)i ), f(u

(1)p )).

Via the mapping f , we can denote the network structuredifferences between G(1) and G(2) as the summation of thelink projection difference between them

L(G(1), G

(2), f) = (25)∑

u(1)i∈U(1)

∑u

(1)p ∈U(1)

(AG(1) (u

(1)i , u

(1)p )− A

G(1) (f(u(1)i ), f(u

(1)p ))

)2.

(26)

Formally, the one-to-one projection can be represented asa matrix P as well, where entry P (i, j) = 1 iff anchor link

(u(1)i , u

(2)j ) exists between networks G(1) and G(2). Via the

matrix P, we can represent the above loss term as

L(AG(1) ,AG(2) ,P) =∥∥∥P>AG(1)P−AG(2)

∥∥∥2. (27)

If there exists a perfect mapping of users across networks,we can obtain a mapping matrix P introducing zero lossin the above function, i.e., L(AG(1) ,AG(2) ,P) = 0. Infer-ring the optimal mapping matrix P which can introduce theminimum loss can be represented as the following objectivefunction

P∗ = arg minP

∥∥∥P>AG(1)P−AG(2)

∥∥∥2

, (28)

where the matrix P is usually subject to some constraint,like P is binary and each row and column should contain atmost one entry being filled with value 1.

In general, it is not easy to find the optimal solution tothe above objective function, as it is a purely combinatorialproblem. Identifying the optimal solution requires the enu-meration of all the potential user mapping across differentnetworks. In [106], Umeyama provides an algorithm thatcan solve the function with a nearly optimal solution.

4.3 Global Unsupervised Alignment of Multi-ple Social Networks

The works introduced in the previous section are all aboutpairwise network alignment, which focus on the alignmentof two networks only. However, in the real-world, people arenormally involved in multiple (usually more than two) socialnetworks simultaneously. In this section, we will focus onthe simultaneous alignment problem of multiple (more thantwo) networks, which is called the “multiple anonymizedsocial networks alignment” problem formally [140].

To help illustrate the multi-network alignment problem moreclearly, we also give an example in Figure 2, which involves 3different social networks (i.e., networks I, II and III). Usersin these 3 networks are all anonymized and their names arereplaced with randomly generated identifiers. Each pair ofthese 3 anonymized networks can actually share some com-mon users, e.g., “David” participates in both networks I andII simultaneously, “Bob” is using networks I and III concur-rently, and “Charles” is involved in all these 3 networks at

10

Network I

A

D

C

B

Network III

3 6

2

Network II

c

e

d

AliceBob

CharlesDavidEva

Frank

User Accounts

social linkanchor link

Links

Figure 2: An example of multiple anonymized partiallyaligned social networks.

the same time. Besides these shared anchor users, in these3 partially aligned networks, some users are involved in onesingle network only (i.e., the non-anchor users [146]), e.g.,“Alice” in network I, “Eva” in network II and “Frank” innetwork III. The problem studied in this part aims at dis-covering the anchor links (i.e., the dashed bi-directional redlines) connecting anchor users across these 3 social networksrespectively.

The significant difference of the studied problem from exist-ing two network alignment problems is due to the “transitiv-ity law” that anchor links follow. In traditional set theory,a relation R is defined to be a transitive relation in domainX iff ∀a, b, c ∈ X , (a, b) ∈ R ∧ (b, c) ∈ R → (a, c) ∈ R. If wetreat the union of user account sets of all these social net-works as the target domain X and treat anchor links as therelation R, then anchor links depict a “transitive relation”among users across networks. We can take the networksshown in Figure 2 as an example. Let u be a user involvedin networks I, II and III simultaneously, whose accounts inthese networks are uI , uII and uIII respectively. If anchorlinks (uI , uII) and (uII , uIII) are identified in aligning net-works (I, II) and networks (II, III) respectively (i.e., uI , uII

and uIII are discovered to be the same user), then anchorlink (uI , uIII) should also exist in the alignment result ofnetworks (I, III) as well. In the multi-network alignmentproblem, we need to guarantee the inferred anchor links canmeet the transitivity law. Formally, the multi-network align-ment problem can be represented as follows.

Given the n isolated social networks G(1), G(2), · · · , G(n),the multi-network alignment problem aims at discoveringthe anchor links among these n networks, i.e., the anchorlink sets A(1,2),A(1,3), · · · ,A(n−1,n). These n social etworksG(1), G(2), · · · , G(n) are partially aligned and the constrainton anchor links in A(1,2),A(1,3), · · · ,A(n−1,n) is one-to-one,which also follow the transitivity law.

To solve the multi-network alignment problem, a novel net-work alignment framework Uma (Unsupervised Multi-networkAlignment) is proposed in [140]. Uma addresses the multi-network alignment problem with two steps: (1) unsuper-vised transitive anchor link inference across multi-networks,and (2) transitive multi-network matching to maintain theone-to-one constraint.

4.3.1 Unsupervised Network Alignment Loss Func-tion

Anchor links between any two given networks G(i) and G(j)

actually define an one-to-one mapping (of users and social

links) between G(i) and G(j). To evaluate the quality ofdifferent inferred mapping (i.e., the inferred anchor links),Uma introduces the concepts of cross-network FriendshipConsistency/Inconsistency concept in [140]. The optimal in-ferred anchor links are those which can maximize the Friend-ship Consistency (or minimize the Friendship Inconsistency)across networks. Formally, given two partially aligned socialnetworks G(i) = (U (i), E(i)) and G(j) = (U (j), E(j)), we canrepresent their corresponding social adjacency matrices to

be S(i) ∈ R|U(i)|×|U(i)| and S(j) ∈ R|U

(j)|×|U(j)| respectively.

Meanwhile, given anchor link setA(i,j) ⊂ U (i)×U (j) betweennetworks G(i) and G(j), the binary transitional matrix from

G(i) to G(j) can be represented as T(i,j) ∈ 0, 1|U(i)|×|U(j)|,

where T(i,j)(l,m) = 1 iff link (u(i)l , u

(j)m ) ∈ A(i,j), u

(i)l ∈ U

(i),

u(j)m ∈ U (j). The binary transitional matrix from G(j) toG(i) can be defined in a similar way, which can be repre-

sented as T(j,i) ∈ 0, 1|U(j)|×|U(i)|, where (T(i,j))> = T(j,i)

as the anchor links between G(i) and G(j) are undirected.Considering that anchor links have an inherent one-to-oneconstraint, each row and each column of the binary tran-sitional matrices T(i,j) and T(j,i) should have at most oneentry filled with 1, which will constrain the inference spaceof potential binary transitional matrices T(i,j) and T(j,i)

greatly.

Uma defines the friendship inconsistency as the number ofnon-shared social links between those mapped from G(i) andthose in G(j). Based on the inferred anchor transitional ma-trix T(i,j), the introduced friendship inconsistency betweenmatrices (T(i,j))>S(i)T(i,j) and S(j) can be represented as:∥∥∥(T(i,j))>S(i)T(i,j) − S(j)

∥∥∥2

F, (29)

where ‖·‖F denotes the Frobenius norm. And the optimal

binary transitional matrix T(i,j), which can lead to the min-imum friendship inconsistency can be represented as

T(i,j) = arg minT(i,j)

∥∥∥(T(i,j))>S(i)T(i,j) − S(j)∥∥∥2

F(30)

s.t. T(i,j) ∈ 0, 1|U(i)|×|U(j)|, (31)

T(i,j)1|U(j)|×1 4 1|U

(i)|×1, (32)

(T(i,j))>1|U(i)|×1 4 1|U

(j)|×1, (33)

where the last two equations are added to maintain the one-to-one constraint on anchor links and X 4 Y iff X is of thesame dimensions as Y and every entry in X is no greaterthan the corresponding entry in Y.

4.3.2 Transitivity Constraint on Alignment ResultsIsolated network alignment can work well in addressing thealignment problem of two social networks. However, in themulti-network alignment problem studied in this part, mul-tiple social networks (more than two) social networks are tobe aligned simultaneously. Besides minimizing the friend-ship inconsistency between each pair of networks, the tran-sitivity property of anchor links also needs to be preservedin the transitional matrices inference.

The transitivity property should holds for the alignment ofany n networks, where the minimum of n is 3. To help illus-trate the transitivity property more clearly, here we will use

11

3 network alignment as an example to introduce the multi-network alignment problem and the Uma model, which canbe easily generalized to the case of n networks alignment.Let G(i), G(j) and G(k) be 3 social networks to be alignedconcurrently. To accommodate the alignment results andpreserve the transitivity property, Uma introduces the fol-lowing alignment transitivity penalty :

Definition 21. (Alignment Transitivity Penalty): For-

mally, let T(i,j), T(j,k) and T(i,k) be the inferred binarytransitional matrices from G(i) to G(j), from G(j) to G(k)

and from G(i) to G(k) respectively among these 3 networks.The alignment transitivity penalty C(G(i), G(j), G(k)) in-troduced by the inferred transitional matrices can be quanti-fied as the number of inconsistent social links being mappedfrom G(i) to G(k) via two different alignment paths G(i) →G(j) → G(k) and G(i) → G(k), i.e.,

C(G(i), G

(j), G

(k)) = (34)∥∥∥(T(j,k))>

(T(i,j)

)>S

(i)T

(i,j)T

(j,k) − (T(i,k)

)>S

(i)T

(i,k)∥∥∥2

F. (35)

Alignment transitivity penalty is a general penalty conceptand can be applied to n networks G(1), G(2), · · · , G(n),n ≥ 3 as well, which can be defined as the summation ofpenalty introduced by any three networks in the set, i.e.,

C(G(1), G

(2), · · · , G(n)) (36)

=∑

∀G(i),G(j),G(k)⊂G(1),G(2),··· ,G(n)

C(G(i), G

(j), G

(k)). (37)

The optimal binary transitional matrices T(i,j), T(j,k) andT(k,i) which can minimize friendship inconsistency and thealignment transitivity penalty at the same time can be rep-resented to be

T(i,j)

, T(j,k)

, T(k,i)

(38)

= arg minT(i,j),T(j,k),T(k,i)

∥∥∥∥(T(i,j))>

S(i)

T(i,j) − S

(j)∥∥∥∥2F

+ (39)

∥∥∥∥(T(j,k))>

S(j)

T(j,k) − S

(k)∥∥∥∥2F

+

∥∥∥∥(T(k,i))>

S(k)

T(k,i) − S

(i)∥∥∥∥2F

(40)

+ α

∥∥∥∥(T(j,k))>

(T(i,j)

)>

S(i)

T(i,j)

T(j,k) − T

(k,i)S

(i)(T

(k,i))>∥∥∥∥2F

(41)

s.t. T(i,j) ∈ 0, 1|U

(i)|×|U(j)|,T

(j,k) ∈ 0, 1|U(j)|×|U(k)|

(42)

T(k,i) ∈ 0, 1|U

(k)|×|U(i)|(43)

T(i,j)

1|U(j)|×1 4 1

|U(i)|×1, (T

(i,j))>

1|U(i)|×1 4 1

|U(j)|×1, (44)

T(j,k)

1|U(k)|×1 4 1

|U(j)|×1, (T

(j,k))>

1|U(j)|×1 4 1

|U(k)|×1,(45)

T(k,i)

1|U(i)|×1 4 1

|U(k)|×1, (T

(k,i))>

1|U(k)|×1 4 1

|U(i)|×1, (46)

where parameter α denotes the weight of the alignment tran-sitivity penalty term, which is set as 1 by default.

The above objective function aims at obtaining the hardmappings among users across different networks and entriesin all these transitional matrices are binary, which can leadto a fatal drawback: hard assignment can be neither possiblenor realistic for networks with star structures as proposedin [54] and the hard subgraph isomorphism [55] is NP-hard.To address the function, Uma proposes to relax the hard bi-nary constraints on the variables first and solve the functionwith gradient descent. Furthermore, based on the learn-ing results Uma keeps the one-to-one constraint on anchorlinks by selecting those which can maximize the overall ex-istence probabilities while maintaining the matching transi-tivity property at the same time.

4.4 Semi-Supervised Network AlignmentAs mentioned before, in the real-world online social net-works, the anchor links are extremely difficult to label man-ually. The training set we can obtain are usually of a smallsize compared with the network scale. For instance, giventhe Facebook and Twitter networks containing billions andmillions of users respectively, identifying a training set withthousands correct anchor links is not an easy task. Mean-while, between Facebook and Twitter, the total number ofpotential anchor links could be of the scale 1015. Therefore,besides the small sized identified anchor links, there usuallyexist a very large number of unlabeled anchor links, whichare extremely hard to predict.

In this part, we will be focused on the network alignmentproblem based on the semi-supervised learning setting. Be-sides these identified anchor links, we also try to make utilizeof the unlabeled anchor links in the model building. Giventwo heterogeneous online social networks G(1) and G(2), anda set of labeled anchor link instances Atrain as well as alarge number of unlabeled anchor link instancesAunlabeled =U (1)×U (2) \Atrain, we aim at building a model with the la-beled and unlabeled anchor link sets Atrain and Aunlabeled.In our network alignment task, the test set is a subset ofor equal to the unlabeled set, i.e., Atest ⊆ Aunlabeled. Thebuilt model will be further applied to the test set to inferthe potential labels of these anchor links.

To address the problem, in this part, we will introducethe semi-supervised network alignment model introduced in[126], which solves the problem as an optimization problemand models one-to-one cardinality constraint on the anchorlinks as a mathematical constraint.

4.4.1 Loss Function for Anchor LinksLet set L = U (1) × U (2) denote all the potential anchorlinks between networks G(1) and G(2), where L = Atrain ∪Aunlabeled. Based on the whole link set L, as introduced inthe previous sections, a set of features can be extracted forthese links with the information available in the informationnetwork G, which can be represented as set X = xll∈L(xl ∈ Rm, ∀l ∈ L). Given the link existence label set Y =0, 1, the objective of the problem studied in this part is toachieve a general link inference function f : X → Y to mapthe link feature vectors to their corresponding labels. Here,0 denotes the label of the negative class. Depending on thespecific application setting and information available in thenetworks, the feature vectors extracted for links in L can bevery diverse.

Formally, the loss introduced in the mapping f(·) can berepresented as function L : X ×Y → R over the link featurevector/label pairs. Meanwhile, for one certain input featurevector xl for link l ∈ L, we can denote its inferred labelintroducing the minimum loss as yl:

yl = arg minyl∈Y,w

L(xl, yl; w), (47)

where vector w denotes the parameters involved in the map-ping function f(·).Therefore, given the pre-defined loss function L(·), the gen-eral form of the objective mapping f : X → Y parameterizedby vector w can be represented as:

f(x; w) = arg minyl∈Y

L(x, y; w). (48)

12

In many cases (e.g., when the links are not linearly separa-ble), the feature vector xl of link l needs to be transformedas g(xl) ∈ Rk (k is the transformed feature number) andthe transformation function g(·) can be different kernel pro-jections depending on the separability of instances. Herewe assume loss function L(·) to be linear in some combinedrepresentation of the transformed link feature vector g(xl)

>

and label yl, i.e.,

L(xl, yl; w) = (〈w, g(xl)〉 − yl)2 = (w>g(xl)− yl)2. (49)

Furthermore, based on all the links in the network L, wecan represent the extracted feature vectors for these links tobe matrix X = [g(xl1), g(xl2), · · · , g(xl|L|)]

> ∈ R|L|×k (for

simplicity, linear kernel projection is used here, and g(xl) =xl). Meanwhile, their existence labels can be represented asvector y = [yl1 , yl2 , · · · , yl|L| ]

>, where yl ∈ 0, 1, ∀l ∈ L.Specifically, for the existing links in E , we know their labelsto be positive in advance, i.e., yl = 1, ∀l ∈ E . According tothe above loss function definition, based on X and y, theloss introduced by all links in L can be represented to be

L(X,y; w) = ‖Xw − y‖22 . (50)

To learn the parameter vector w and infer the potentiallabel vector y, [126] proposes to minimize the loss term in-troduced by all the links in L. Meanwhile, to avoid over-fitting the training set, besides minimizing the loss functionL(X,y; w), a regularization term ‖w‖22 about the parametervector w is added to the objective function:

minw,y

1

2‖w‖22 +

c

2‖Xw − y‖22 , (51)

s.t. y ∈ 0, 1|L|×1, and yl = 1, ∀l ∈ E , (52)

where constant c denotes the weight of the loss term in thefunction.

4.4.2 Cardinality Constraint on Anchor LinksThe cardinality constraints define both the limit on link car-dinality and the limit on node degrees that those links areincident to. To be general, the links studied here can be ei-ther uni-directed or bi-directed, where undirected links aretreated as bi-directed. For each node u ∈ V in the net-work, we can represent the potential links going-out fromu as set Γout(u) = l|l ∈ L, ∃v ∈ V, l = (u, v), and thosegoing-into u as set Γin(u) = l|l ∈ L, ∃v ∈ V, l = (v, u).Furthermore, with the link label variables yll∈L, we canrepresent the out-degree and in-degree of node u ∈ V asDout(u) =

∑l∈Γout(u) yl and Din(u) =

∑l∈Γin(u) yl respec-

tively. Considering that the node degrees cannot be nega-tive, besides the upper bounds introduced by the cardinalityconstraints, a lower bound “≥ 0” is also added to guaranteevalidity of node degrees by default.

One-to-One Cardinality Constraint

For the bi-directed anchor links with 1 : 1 cardinality con-straint, the nodes in the information networks can be at-tached with at most one such kind of link. In other words,for all the nodes (e.g., u ∈ V) in the network, its in-degreeand out-degree can not exceed 1, i.e.,

0 ≤∑

l∈Γout(u)

yl ≤ 1, and 0 ≤∑

l∈Γin(u)

yl ≤ 1, ∀u ∈ V. (53)

One-to-Many Cardinality Constraint

Algorithm 2 Greedy Link Selection

Input: link estimate result y, parameter kOutput: link label vector y1: initialize link label vector y = 02: for l ∈ E do3: yl = 14: end for5: for l ∈ L \ E and yl < 0.5 do6: yl = 07: end for8: Let L = l|l ∈ L \ E , yl ≥ 0.59: while L 6= ∅ do

10: select l ∈ L with the highest estimation score11: if add l as positive instance violates the cardinality

constraint or more than k links have been selectedthen

12: yl = 013: else14: yl = 115: end if16: end while17: return y

Meanwhile, for the uni-directed supervision links with theN : 1 cardinality constraint, the manager nodes can havemultiple (N) links going out from them while the subor-dinate nodes should have exactly one link going into them(except the CEO). In other words, for all the nodes (e.g.,u ∈ V) in the network, its out-degree cannot exceed N andthe in-degree should be exactly 1, i.e.,

0 ≤∑

l∈Γout(u)

yl ≤ N, and 1 ≤∑

l∈Γin(u)

yl ≤ 1, ∀u ∈ V. (54)

Many-to-Many Cardinality Constraint

In many cases, there usually exist no specific cardinality con-straints on links, and nodes can be connected with eachother freely. Simply, we can assume the node in-degrees andout-degrees to be limited by the maximum degree parameterN = |V| − 1, i.e.,

0 ≤∑

l∈Γout(u)

yl ≤ N, and 0 ≤∑

l∈Γin(u)

yl ≤ N, ∀u ∈ V. (55)

The cardinality constraint on links can be generally repre-sented with the linear algebra equations. The relationshipbetween nodes V and links L can actually be representedas matrices Tout ∈ 0, 1|V|×|L| and Tin ∈ 0, 1|V|×|L|,where entry Tout(u, l) = 1 iff l ∈ Γout(u) and Tin(u, l) = 1iff l ∈ Γin(u). Based on the link label vector y, the nodeout-degrees and in-degrees can be formally represented asvectors Tout · y and Tin · y respectively. The general rep-resentation of the cardinality constraints introduced abovecan be rewritten as follows:

bout ≤ Tout · y ≤ bout, and bin ≤ Tin · y ≤ b

in, (56)

where vectors bout, bout

, bin and bin

can take differentvalues depending on the cardinality constraint on the links(e.g., for the 1 : 1 constraint, we have bout = bin = 0 and

bout

= bin

= 1).

4.4.3 Joint Objective Function SolutionFor simplicity, we assume the weight scalars c1 and c2 bothto be c, i.e., all the links in the networks are assumed to be

13

Algorithm 3 Cardinality Constrained Anchor Link Predic-tion FrameworkInput: link feature vector X

weight parameter cOutput: parameter vector w, link label vector y1: Initialize label vector y = 1

2· 1

2: For links in E , assign their label as 13: Initialize parameter vector w = 04: Initialize convergence-tag = False5: while convergence-tag == False do6: Update vector w with equation w = c(I +

cX>X)−1X>y7: Calculate link estimation result y = Xw8: Update vector y with Algorithm Greedy(y)9: if w and y both converge then

10: convergence-tag = True11: end if12: end while

of similar importance in training. And the new loss term ofall the links in E , U can be simplified as

c

2‖wX− y‖22 , (57)

where matrix X = [x>l1 ,x>l2, · · · ,x>l|L|]

T denotes the feature

matrix of all the links in L.

Based on the above remarks, the constrained optimizationobjective function of the problem can be represented as

minw,y

1

2‖w‖22 +

c

2‖Xw − y‖22 , (58)

s.t. y ∈ 0, 1|L|×1, yl = 1, ∀l ∈ E , (59)

bout ≤ Tout · y ≤ bout,bin ≤ Tin · y ≤ b

in. (60)

The above objective function involves variables w and y atthe same time, which is actually not jointly convex and canbe very challenging to solve. In [126], the proposed modelsolves the function with an alternative updating frameworkby fixing one variable and updating the other one iteratively.The framework involves two steps:

Step 1: Fix y and Update w

By fixing y (i.e., treating y as a constant vector), the objec-tive function about w can be simplified as

minw

1

2‖w‖22 +

c

2‖Xw − y‖22 . (61)

Let h(w) = 12‖w‖22+ c

2‖Xw − y‖22. By taking the derivative

of the function h(w) regarding w we can have

dh(w)

dw= w + cXwX> − cyX>. (62)

By making the derivation to be zero, the optimal vector wcan be represented to be

w = c(I + cX>X)−1X>y, (63)

and the minimum value of the function will be c2y>y −

c2

2y>X(I + cX>X)−1X>y.

Step 2: Fix w and Update y

When fixing w and treating it as a constant vector, the

objective function about y can be represented as

miny

c

2‖y − y‖22 , (64)

s.t. y ∈ 0, 1|L|×1, yl = 1, ∀l ∈ E , (65)

bout ≤ Tout · y ≤ bout,bin ≤ Tin · y ≤ b

in, (66)

where y = Xw denotes the inference results of the links inL with the updated parameter vector w from Step 1. Theobjective function is an constrained non-linear integer pro-gramming problem about variable y. Formally, the aboveoptimization sub-problem is named as the “Cardinality Con-strained Link Selection” problem. The problem is shown tobe NP-hard (we will analyze it in the next subsection), andachieving the optimal solution to it is very time consum-ing. To preserve the cardinality constraints on the variablesand minimize the loss term, one brute-force way to achievethe optimal solution y is to enumerate all the feasible com-bination of links candidates to be selected as the positiveinstances, which will lead to very high time complexity. In[126], a greedy link selection algorithm is adopted to resolvethe problem, and the pseudo-code of the greedy link selec-tion method is available in Algorithm 2. Meanwhile, theframework is illustrated with the pseudo-code available inAlgorithm 3. The framework updates vectors w and y al-ternatively until both of them converge, where vector y willbe returned as the final prediction results.

5. LINK PREDICTIONGiven a screenshot of an online social network, the problemof inferring the missing links or the links to be formed inthe future is called the link prediction problem. Link pre-diction problem has concrete applications in the real world,and many social network services can be cast to the link pre-diction problem. For instance, the friend recommendationsproblem in online social networks can be modeled as thesocial link prediction problem among users. Users’ trajec-tory prediction problem can be formulated as the predictiontask of potential checkin links between users and offline POIs(point of interest) in the location based social networks. Theuser identifier resolution problem across networks (i.e., thenetwork alignment problem introduced in the previous sec-tion) can be modeled as the anchor link prediction problemof user accounts across different online social networks.

In this section, we will introduce the general link predictionproblems in the online social networks. Formally, given thetraining set Ttrain involving links belong to different classes(Y = +1,−1 denoting the links that have been/will beformed and those will never be formed) and the test setTtest (with unknown labels for the links), the link predictionproblem aims at building a mapping f : Ttest → Y to inferthe potential labels of links in the test set Ttest.Depending on the scenarios of the link prediction problems,the existing links prediction works can be divided into sev-eral different categories. Traditional link prediction prob-lems are mainly focused on inferring the links in one sin-gle homogeneous network, like inferring the friendship linksamong users in online social networks or co-author links inbibliographic networks. As the network structures are be-coming more and more complicated, many of them are mod-eled as the heterogeneous networks involving different typesof nodes and complex connections among them. The hetero-geneity of the networks leads to many new link prediction

14

problems, like predicting the links between nodes belong-ing to different categories and the concurrent inference ofmultiple types of links in the heterogeneous networks. Inrecent years, many online social networks have appeared,and lots of new research opportunities exist for researchersand practitioners to study the link prediction problem fromthe cross-network perspective.

Meanwhile, depending on the learning settings used in thelink prediction problem formulation and models, the existinglink prediction works can be categorized in another way. Forsome of the link prediction models, they calculate the user-pair closeness as the prediction result without needing anytraining data, which are referred to as the unsupervised linkprediction models. For some other models, they will labelthe known links into different classes, and use them as thetraining set to learn a supervised classification models as thebase model instead. These models are called the supervisedlink prediction models. Usually, manual labeling of the linksis very expensive and tedious. In recent years, many ofthe works have proposed to apply semi-supervised learningtechniques in the link prediction problem to utilize the linkswithout labels.

In this part, we will introduce the link prediction problemsin online social networks, including the traditional homoge-neous link prediction, cold start link prediction, and cross-network link prediction, which covers the PU link predictionand sparse and low rank matrix estimation based link pre-diction.

5.1 Traditional Homogeneous Network LinkPrediction

Traditional link prediction problems are mainly studied basedon one homogeneous network, involving one single type ofnodes and links. In this section, we will first briefly introducehow to use the social closeness measures for link predictiontasks. To integrate different social closeness measures to-gether in the link prediction task, we will talk about thesupervised link prediction model. Finally, we will introducesome models which formulate the link prediction task as arecommendation problem, and apply the matrix factoriza-tion method to address the problem.

5.1.1 Unsupervised Link PredictionGiven a screenshot of a homogeneous network G = (V, E),the unsupervised link prediction methods [61] aims at infer-ring the potential links that will be formed in the future.Usually, the unsupervised link prediction models will cal-culate some scores for the links, which will be used as thepredicted confidence scores of these links. Depending onthe specific scenario and the link formation assumptions ap-plied, different measures have been proposed for the linkprediction models.

Local Neighbor based Predicators: Local neighbor basedpredicators are based on regional social network informa-tion, i.e., neighbors of users in the network. Consider, forexample, given a social link (u, v) in network G, where uand v are both users in G, the neighbor sets of u, v can berepresented as Γ(u) and Γ(v) respectively. Based on Γ(u)and Γ(v), the following predicators measuring the proximityof users u and v in network G can be obtained.

1. Preferential Attachment Index (PA) [6]:

PA(u, v) = |Γ(u)| |Γ(v)| . (67)

PA(u, v) uses the product of the degrees of users u andv in the network as the proximity measure, consideringthat new links are more likely to appear between userswho have large number of social connections.

2. Common Neighbor (CN) [38]:

CN(u, v) = |Γ(u) ∩ Γ(v)| . (68)

CN(u, v) uses the number of shared neighbor as theproximity score of user u and v. The larger CN(u, v)is, the closer user u and v are in the network.

3. Jaccard’s Coefficient (JC) [38]:

JC(u, v) =|Γ(u) ∩ Γ(v)||Γ(u) ∪ Γ(v)| . (69)

JC(u, v) takes the total number of neighbors of u andv into account, considering that CN(u, v) can be verylarge because each one has a lot of neighbors ratherthan they are strongly related to each other.

4. Adamic/Adar Index (AA) [2]:

AA(u, v) =∑

w∈(Γ(u)∩Γ(v))

1

log |Γ(w)| . (70)

Different from JC(u, v), AA(u, v) further gives eachcommon neighbor of user u and v a weight, 1

log|Γ(w)| ,

to denote its importance.

5. Resource Allocation Index (RA) [150]:

RA(u, v) =∑

w∈(Γ(u)∩Γ(v))

1

|Γ(w)| . (71)

RA(u, v) gives each common neighbor a weight 1|Γ(w)|

to represent its importance, where those with largerdegrees will have a less weight number.

All these predicators are called local neighbor based predi-cators as they are all based on users’ local social networkinformation.

Global Path based Predicators: In addition to the localneighbor based predicators, many other predicators basedon paths in the network have also been proposed to measurethe proximity among users.

1. Shortest Path (SP) [37]:

SP (u, v) = min|pu v|, (72)

where pu v denotes a path from u to v in the networkand |p| represents the length of path p.

2. Katz [45]:

Katz(u, v) =∞∑l=1

βl∣∣∣plu v∣∣∣ , (73)

where plu v is the set of paths of length l from u to vand parameter β ∈ [0, 1] is a regularizer of the predi-cator. Normally, a small β favors shorter paths as βl

15

can decay very quickly when β is small, in which caseKatz(u, v) will be behave like the predicators basedon local neighbors.

Random Walk based Link Prediction: In addition tothe unsupervised link predicators which can be obtainedfrom the networks directly, there exists another categorylink prediction methods which can calculate the proximityscores among users based on random walk [34; 32; 52; 5;103; 64; 38]. In this part, we will introduce the concept ofrandom walk at first. Next, we will introduce the proximitymeasures based on random walk, which include the com-mute time [32; 64; 38], hitting time [32; 64; 38] and cosinesimilarity [32; 64; 38].

Let matrix A be the adjacency matrix of network G, whereA(i, j) = 1 iff social link (ui, uj) ∈ E , where ui, uj ∈ V. Thenormalized matrix of A by rows will be P = D−1A, wherediagonal matrix D of A has value D(i, i) =

∑j A(i, j) on

its diagonal and P (i, j) stores the probability of steppingon node uj ∈ V from node ui ∈ V. Let entries in vectorx(τ)(i) denote the probabilities that a random walker is atuser node ui ∈ V at time τ . Then we have the updatingequation of entry x(τ)(i) via the random walk as follows:

x(τ+1)(i) =∑j

x(τ)(j)P(j, i). (74)

In other words, the updating equation of vector x will berepresented as:

x(τ+1) = Px(τ). (75)

By keeping updating x according to the following equationuntil convergence, we can have the stationary vector x(τ+1)

as x(τ+1) = PTx(τ),

x(τ+1) = x(τ).(76)

The above equation is equivalent to

v = PTv, (77)

where vector v denotes the stationary random walk proba-bility vector.

The above equation denotes that the final stationary dis-tribution vector v is actually a eigenvector of matrix PT

corresponding to eigenvalue 1. Some existing works havepointed out that if a markov chain is irreducible [32] andaperiodic [32] then the largest eigenvalue of the transitionmatrix will be equal to 1 and all the other eigenvalues willbe strictly less than 1. In addition, in such a condition, therewill exist one single unique stationary distribution which isvector v obtained at convergence of the updating equations.

Definition 22. (Irreducible): Network G is irreducible ifthere exists a path from every node to every other nodes inG [32].

Definition 23. (Aperiodic): Network G is aperiodic ifthe greatest common divisor of the lengths of its cycles inG is 1, where the greatest common divisor is also called theperiod of G [32].

Proximity Measures based on Random Walk

1. Hitting Time (HT):

HT (u, v) = E(

minτ |τ ∈ N+, X(τ) = v ∧X0 = u), (78)

where variable X(τ) = v denotes that a random walkeris at node v at time τ .

HT (u, v) counts the average steps that a random walkertakes to reach node v from node u. According to thedefinition, the hitting time measure is usually asym-metric, HT (u, v) 6= HT (v, u). Based on matrix P de-fined before, the definition of HT (u, v) can be rede-fined as [32]:

HT (u, v) = 1 +∑

w∈Γ(u)

Pu,wHT (w, v). (79)

2. Commute Time (CT):

CT (u, v) = HT (u, v) +HT (v, u). (80)

CT (u, v) counts the expectation of steps used to reachnode u from v and those needed to reach node v fromu. According to existing works, the commute time,CT (u, v), can be obtained as follows

CT (u, v) = 2m(L†u,u + L†v,v − 2L†u,v), (81)

where L† is the pseudo-inverse of matrix L = DA−A.

3. Cosine Similarity based on L† (CS):

CS(u, v) =xTuxv√

(xTuxu)(xTv xv), (82)

where, xu = (L†)12 eu and vector eu is a vector of 0s

except the entries corresponding to node u that is filledwith 1. According to existing works [32; 64], the cosinesimilarity based on L† , CS(u, v), can be obtained asfollows,

CS(u, v) =L†u,v√L†u,uL

†v,v

. (83)

4. Random Walk with Restart (RWR): Based on the def-inition of random walk, if the walker is allowed to re-turn to the starting point with a probability of 1 − c,where c ∈ [0, 1], then the new random walk methodis formally defined as random walk with restart, whoseupdating equation is shown as follows:

x(τ+1)u = cPTx

(τ)u + (1− c)eu,

x(τ+1)u = x

(τ)u .

(84)

Keep updating x until convergence, the stationary dis-tribution vector x can meet

xu = (1− c)(I− cPT )−1eu. (85)

The proximity measure based on random walk withrestart between user u and v will be

RWR(u, v) = xu(v), (86)

where xu(v) denotes the entry corresponding to v invector xu.

16

5.1.2 Supervised Link PredictionIn some cases, links in the networks are explicitly categorizedinto different groups, like links denoting friends vs those rep-resenting enemies, friends (formed connections) vs strangers(no connections). Given a set of labeled links, e.g., set E ,containing links belonging to different classes, the supervisedlink prediction [37] problem aims at building a supervisedlearning model with the labeled set. The learnt model willbe applied to determine the labels of links in the test set.In this part, we still take the link formation problem as anexample to illustrate the supervised link prediction model.

To represent each of the social links, like link l = (u, v) ∈ Ebetween nodes u and v, a set of features representing thecharacteristics of the link l or nodes u, v will be extractedin the model building. Normally, the features can be ex-tracted for links in the prediction task can be divided intotwo categories:

Link Feature Extraction

• Features of Nodes: The characteristics of the nodescan be denoted by various measures, like these variousnode centrality measures. For instance, for the link(u, v), based on the known links in the training set, thecentrality measures can be computed based on degree,normalized degree, eigen-vector, Katz, PageRank, Be-tweenness of nodes u and v as part of the features forlink (u, v).

• Features of Links: The characteristics of the links inthe networks can be calculated by computing the close-ness between the nodes composing the nodes. For in-stance, for link (u, v), based on the known links in thetraining set, the closeness measures can be computedbased on reciprocity, common neighbor, Jaccard’s co-efficient, Adamic/Adar, shortest path, Katz, hittingtime, commute time, etc. between nodes u and v asthe features for link (u, v).

We can append the features for nodes u, v and those for link(u, v) together and represent the extracted feature vector forlink l = (u, v) as vector xl ∈ Rk×1, whose length is k in total.

Link Prediction Model

With the training set Ltrain, the feature vectors and labelsfor the links in Ltrain can be represented as the training data(xl, yl)l∈Ltrain . Meanwhile, with the testing set Ltest, thefeatures extracted for the links in it can be represented asxll∈Ltrain . Different classification models can be used asthe base model for the link prediction task, like the DecisionTree, Artificial Neural Network and Support Vector Machine(SVM). The model can be trained with the training data,and the labels of links in the test can be determined byapplying models to the test set.

Depending on the specific model being applied, the out-put of the link prediction result can include (1) the pre-dicted labels of the links, and (2) the prediction confidencescores/probability scores of links in the test set.

5.1.3 Matrix Factorization based Link PredictionBesides unsupervised link predicators and the classificationbased supervised link prediction models, many other meth-ods based on matrix factorization can also be applied tosolve the link prediction task in homogeneous networks [1;101; 27].

Given a homogeneous social network G = (V, E) and theexisting social links among users in set E , the links can rep-resented with the social adjacency matrix A ∈ 0, 1|V|×|V|.Given the adjacency matrix A of network G, [125] proposes

to use a low-rank compact representation, U ∈ R|V|×d, d <|V|, to store social information for each user in the network.Matrix U can be obtained by solving the following optimiza-tion objective function:

minU,V

∥∥∥A−UVUT∥∥∥2

F, (87)

where U is the low rank matrix and matrix V saves thecorrelation among the rows of U, ‖X‖F is the Frobeniusnorm of matrix X.

To avoid overfitting, regularization terms ‖U‖2F and ‖V‖2Fare added to the object function as follows:

minU,V

∥∥∥A−UVUT∥∥∥2

F+ α ‖U‖2F + β ‖V‖2F , (88)

s.t.,U ≥ 0,V ≥ 0, (89)

where α and β are the weight of terms ‖U‖2F , ‖V‖2F respec-tively.

This object function is very hard to achieve the global op-timal result for both U and V. A alternative optimizationschema can be used here, which can update U and V al-ternatively. The Lagrangian function of the object equationshould be:

F = Tr(AAT )− Tr(AUVTUT ) (90)

− Tr(UVUTAT ) + Tr(UVUTUVTUT ) (91)

+ αTr(UUT ) + βTr(VVT )− Tr(ΘU)− Tr(ΩV), (92)

where Θ and Ω are the multiplier for the constraint of Uand V respectively.

By taking derivatives of F with regarding to U and V re-spectively, the partial derivatives of F will be

∂F∂U

=− 2ATUV − 2AUVT + 2UVTUTUVT (93)

+ 2UVUTUVT + 2αU−ΘT , (94)

∂F∂V

=− 2UTAU + 2UTUVUTU + 2βV − ΩT (95)

Let ∂F∂U

= 0 and ∂F∂V

= 0 and use the KKT complementarycondition, we can get:

U(i, j) ← U(i, j)

√(ATUV+AUVT )(i,j)

(UVTUTUV+UVUTUVT+αU)(i,j),

V(i, j) ← V(i, j)

√(UTAU)(i,j)

(UTUVUTU+βV)(i,j).

(96)

The low-rank matrix U captures the information of eachusers from the adjacency matrix. The matrix U can be usedin different ways. For instance, each row of U representsthe latent feature vectors of users in the network, which canbe used in many link prediction models, e.g., supervisedlink prediction models. Meanwhile, based on the matrixV learnt from the model, the predicted score of link (u, v)can be represented as UuVU>v , where notations Uu and Uv

represent the rows in matrix U corresponding to users u andv respectively.

5.2 Cold Start Link Prediction for New UsersThese previous works on link prediction focus on predictingpotential links that will appear among all the users, based

17

upon a snapshot of the social network. These works treat allusers equally and try to predict social links for all users inthe network. However, in real-world social networks, manynew users are joining in the service every day. Predictingsocial links for new users are more important than for thoseexisting active users in the network as it will leave the firstimpression on the new users. First impression often has last-ing impact on a new user and may decide whether he willbecome an active user. A bad first impression can turn anew user away. So it is important to make meaningful rec-ommendation to a new user to create a good first impressionand attract him to participate more. For simplicity, we re-fer users that have been actively using the the network fora long time as “old users”. It has been shown in previousworks that there is a negative correlation between the ageof nodes in the network and their link attachment rates.The distribution of linkage formation probability follows apower-law decay with the age of nodes [50]. So, new usersare more likely to accept the recommended links comparedwith existing old users and predicting links for new userscould lead to more social connections. In this part, we willintroduce a recent research work on link prediction for newusers, which is based on [128].

A natural challenge inherent in the usage of the historicallinks in social networks to predict social links for new usersis the differences in information distributions of new usersand old users as mentioned before. To address this prob-lem, [128] propose a method to accommodate old users’ andnew users’ sub-network by using a within-network personal-ized sampling method to process old users’ information. Bysampling the old users’ sub-network, we want to meet thefollowing objectives:

• Maximizing Relevance: We aim at maximizing the rel-evance of the old users’ sub-network and the new users’sub-network to accommodate differences in informa-tion distributions of new users and old users in theheterogeneous target network.

• Information Diversity : Diversity of old users’ infor-mation after sampling is still of great significance andshould be preserved.

• Structure Maintenance: Some old users possessing sparsesocial links should have higher probability to surviveafter sampling to maintain their links so as to maintainthe network structure.

Let the target network be G = (V, E), and V = Vold∪Vnew isthe set of user nodes (i.e., set of old users and new users) inthe target network. Personalized sampling is conducted onthe old users’ part: Gold = (Vold, Eold), in which each node issampled independently with the sampling rate distributionvector δ = (δ1, δ2, · · · , δn), where n = |Vold|,

∑ni=1 δi = 1

and δi ≥ 0. Old users’ sub-network after sampling is denotedas Gold = (Vold, Vold).We aim at making the old users’ sub-network as relevantto new users’ as possible. To measure the similarity scoreof a user ui and a heterogeneous network G, we define arelevance function as follows:

R(ui, G) =1

|V|∑uj∈V

S(ui, uj) (97)

where set V is the user set of network G and S(ui, uj) mea-sures the similarity between user ui and uj in the network.

Each user has social relationships as well as other heteroge-neous auxiliary information and S(ui, uj) is defined as theaverage of similarity scores of these two parts:

S(ui, uj) =1

2(Saux(ui, uj) + Ssocial(ui, uj)) (98)

There are many different methods measuring the similaritiesof these auxiliary information in different aspects, e.g. cosinesimilarity. As to the social similarity, Jaccard’s Coefficientcan be used to depict how similar two users are in theirsocial relationships.

The relevance between the sampled old users’ network andthe new users’ network could be defined as the expectationvalue of function R(uold, Gnew):

R(Gold, Gnew) = E(R(uold, Gnew)) (99)

=1

|Vnew|

|Vnew|∑j=1

E(S(uold, unew,j)) (100)

=1

|Vnew|

|Vnew|∑j=1

|Vold|∑i=1

δi · S(uold,i, unew,j)

(101)

= δ>s (102)

where vector s equals:

1

|Vnew|

[|Vnew|∑j=1

S(uold,1, unew,j), · · · ,|Vnew|∑j=1

S(uold,n, unew,j)]>

(103)

and |Vold| = n. Besides the relevance, we also need to ensurethat the diversity of information in the sampled old users’sub-network could be preserved. Similarly, it also includesdiversities of the auxiliary information and social relation-ships. The diversity of auxiliary information is determinedby the sampling rate δi, which could be define with the av-eraged Simpson Index [92] over the old users’ sub-network.

Daux(Gold) =1

|Vold|·|Vold|∑i=1

δ2i . (104)

As to the diversity in the social relationship, we could getthe existence probability of a certain social link (ui, uj) aftersampling to be proportional to δi · δj . So, the diversity ofsocial links in the sampled network could be defined as av-erage existence probabilities of all the links in the old users’sub-network.

Dsocial(Gold) =1

|Sold|·|Vold|∑i=1

|Vold|∑j=1

δi · δj × I(ui, uj) (105)

where |Sold| is the size of social link set of old users’ sub-network and I(ui, uj) is an indicator function I : (ui, uj)→0, 1 to show whether a certain social link exists or notoriginally before sampling. For example, if link (ui, uj) is asocial link in the target network originally before sampling,then I(ui, uj) = 1, otherwise it equals to 0.

Considering these two terms simultaneously, we could havethe diversity of information in the sampled old users’ sub-

18

u1

u2u3

u4

u5

u6

Figure 3: Personalized Network Sampling with Preservationof the Network Structure Properties.

network to be the average diversities of these two parts:

D(Gold) =1

2(Dsocial(Gold) +Daux(Gold)) (106)

=1

2(

|Vold|∑i=1

|Vold|∑j=1

1

|Sold|· δi · δj × I(ui, uj) (107)

+

|Vold|∑i=1

1

|Vold|· δ2i ) (108)

= δ> · (1

2|Sold|·Aold +

1

2 |Vold|· I|Vold|) · δ (109)

where matrix I|Vold| is the diagonal identity matrix of size|Vold|× |Vold| and Aold is the adjacency matrix of old users’sub-network.

To ensure that the structure of the original old users’ sub-network is not destroyed, we need to ensure that users withfew links could also preserve their links. So, we could add aregularization term to increase the sampling rate for theseusers as well as their neighbors by maximizing the followingterms:

Reg(Gold) = minNi, minuj∈Ni

Nj × δ2i = δ> ·M · δ (110)

where matrix M is a diagonal matrix with element Mi,i =minNi,minuj∈NiNj = minNi, Ni|uj ∈ Ni andNj =|Γ(uj)| is the size of user uj ’s neighbor set. So, if a user orhis/her neighbors have few links, then this user as well ashis/her neighbors should have higher sampling rate so as topreserve the links between them.

For example, in Figure 3, we have 6 users. To decide thesampling rate of user u1, we need to consider his/her socialstructure. We find that since u1’s neighbor u2 has no otherneighbor except u1. To preserve the social link between u1

and u2 we need to increase the sampling rate of u2. How-ever, the existence probability of link (u1, u2) is also decidedby the sampling rate of user u1, which also needs to be in-creased too. Combining the diversity term and the structurepreservation term, we could define the regularized diversityof information after sampling to be

DReg(Gold) = D(Gold) +Reg(Gold) = δ′ ·N · δ (111)

where N = 12|Vold|

· I|Vold| +1

2|Sold|·Aold + M.

The optimal value of δ should be able to maximize the rele-vance of new users’ sub-network and old users’ as well as theregularized diversity of old users’ information in the target

network

δ∗ = arg maxδ

R(Gold, Gnew) + θ ·DReg(Gold) (112)

= arg maxδ

δ>s+ θ · δ> ·N · δ (113)

s.t.

|Vold|∑i=1

δi = 1 and δi ≥ 0, (114)

where parameter θ is used to weight the importance of termregularized information diversity. The learned sampling ratecan be applied to randomly sampled the old users’ histori-cal information, so as to utilize their information for modelbuilding in predicting social links for the new users.

5.3 Link Prediction across Multiple AlignedSocial Networks

Besides the link prediction problems in one single target net-work, some research works have been done on simultaneouslink prediction in multiple aligned online social networksconcurrently. In the supervised link prediction model in-troduced before, among all the non-existing social links, asubset of the links can be identified and labeled as the nega-tive instances. However, in the real world, labeling the linkswhich will never be formed can be extremely hard and al-most impossible, since new links are keeping being formed.In this section, we will introduce the cross-network concur-rent link prediction problem with PU learning setting.

Let G(i), i ∈ 1, 2, · · · , n be a heterogeneous online socialnetwork in the multiple aligned networks. The user set andexisting social link set of G(i) can be represented as U (i) and

E(i)u,u respectively. In network G(i), all the existing links are

the formed links and, as a result, the formed links of G(i) can

be represented as the positive set P(i), where P(i) = E(i)u,u.

Furthermore, a large set of unconnected user pairs are re-ferred to as the unconnected links, U (i), and can be extractedfrom network G(i): U (i) = U (i) × U (i) \ P(i). However, noinformation about links that will never be formed can beobtained from the network. With P(i) and U (i), we formu-late the link formation prediction as the PU (Positive andUnlabeled) link prediction problem.

Formally, let the notations P(1), · · · ,P(n), U (1), · · · ,U (n)and L(1), · · · ,L(n) be the sets of formed links, uncon-

nected links, and links to be predicted of networksG(1), G(2),· · · , G(n) respectively. With the formed and unconnectedlinks of G(1), G(2), · · · , G(n), the multi-network link predic-tion problem can be formulated as a multi-PU link predictionproblem.

In this part, we will introduce the Mli model proposedin [146] to solve the multi-network link prediction problem.The Mli model includes 3 parts: (1) social meta path basedfeature extraction and selection; (2) PU link prediction; (3)multi-network link prediction framework, where the featureextraction is done based on the inter-network meta paths de-fined in Section 3. Next, we will mainly focus on introducingthe Steps (2) and (3) of the Mli model respectively.

5.3.1 PU Link PredictionIn this subsection, we will introduce a method to solve thePU link prediction problem in one single network. As intro-duced in the problem formulation at the beginning, from agiven network, e.g., G, two disjoint sets of links: connected(i.e., formed) links P and unconnected links U , can be ob-

19

+ + ++

+++

——

—— —

——

++

++ +

++

Spy Positive Links Unlabeled Links

Reliable Negative Links

classification boundary

Feature Space

P-Spy USpy

P N

P NRN

SpyU

training set

test set

classification results

Figure 4: PU Link Prediction.

tained. To differentiate these links, Mli uses a new concept“connection state”, z, to show whether a link is connected(i.e., formed) or unconnected in network G. For a given linkl, if l is connected in the network, then z(l) = +1; otherwise,z(l) = −1. As a result, Mli can have the “connection states”of links in P and U to be: z(P) = +1 and z(U) = −1.

Besides the “connection state”, links in the network can alsohave their own “labels”, y, which can represent whether alink is to be formed or will never be formed in the network.For a given link l, if l has been formed or to be formed, theny(l) = +1; otherwise, y(l) = −1. Similarly, Mli can havethe “labels” of links in P and U to be: y(P) = +1 but y(U)can be either +1 or −1, as U can contain both links to beformed and links that will never be formed.

By using P and U as the positive and negative training sets,Mli can build a link connection prediction model Mc, whichcan be applied to predict whether a link exists in the originalnetwork, i.e., the connection state of a link. Let l be a linkto be predicted, by applyingMc to classify l, the connectionprobability of l can be represented to be:

Definition 24. (Connection Probability): The probabil-ity that link l’s connection states is predicted to be connected(i.e., z(l) = +1) is formally defined as the connection prob-ability of link l: p(z(l) = +1|x(l)), where x(l) denotes thefeature vector extracted for link l based on meta path.

Meanwhile, if we can obtain a set of links that “will never beformed”, i.e., “-1” links, from the network, which togetherwith P (“+1” links) can be used to build a link formationprediction model, Mf , which can be used to get the forma-tion probability of l to be:

Definition 25. (Formation Probability): The probabilitythat link l’s label is predicted to be formed or will be formed(i.e., y(l) = +1) is formally defined as the formation prob-ability of link l: p(y(l) = +1|x(l)).

However, from the network, we have no information about“links that will never be formed” (i.e., “-1” links). As aresult, the formation probabilities of potential links that weaim to obtain can be very challenging to calculate. Mean-while, the correlation between link l’s connection probabilityand formation probability has been proved in existing works[28] to be:

p(y(l) = +1|x(l)) ∝ p(z(l) = +1|x(l)). (115)

In other words, for links whose connection probabilities arelow, their formation probabilities will be relatively low aswell. This rule can be utilized to extract links which can bemore likely to be the reliable “-1” links from the network.

Network 1

Network N

…

y(P1), y(U1)y(L1)

update network

Network 2

update network

update network

build predict

build predictP1,U1 L1

build predict

L2P2,U2y(P2), y(U2)

y(L2)

y(Ln)Pn,Un Ln

feature extraction

feature extraction

feature extraction

… …

p(Ln)

p(L2)

p(L1)x(P1),x(U1)

x(P2),x(U2)

x(Pn),x(Un)

x (Pn),x (Un)

x (P2),x (U2)

x (P1),x (U1)x (L1)

x (L2)

x (Ln)

x(Ln)

x(L2)

x(L1)

y(Pn), y(Un)

M1,MS1

M2,MS2

Mn,MSn

Figure 5: Multi-PU Link Prediction Framework.

Mli proposes to apply the the link connection predictionmodel Mc built with P and U to classify links in U to extractthe reliable negative link set.

Definition 26. (Reliable Negative Link Set): The reli-able negative links in the unconnected link set U are thosewhose connection probabilities predicted by the link connec-tion prediction model, Mc, are lower than threshold ε ∈[0, 1]:

RN = l|l ∈ U , p(z(l) = +1|x(l)) < ε. (116)

Some Heuristic methods have been proposed to set the opti-mal threshold ε, e.g., the spy technique proposed in [63]. Asshown in Figure 4, Mli proposes randomly select a subsetof links in P as the spy, SP, whose proportion is controlledby s%. s% = 15% is used as the default sample rate in[146]. Sets (P −SP) and (U ∪SP) are used as positive andnegative training sets to the spy prediction model, Ms. Byapplying Ms to classify links in (U ∪ SP), their connectionprobabilities can be represented to be:

p(z(l) = +1|x(l)), l ∈ (U ∪ SP), (117)

and parameter ε is set as the minimal connection probabilityof spy links in SP:

ε = minl∈SP

p(z(l) = +1|x(l)). (118)

With the extracted reliable negative link set RN , Mli cansolve the PU link prediction problem with classification basedlink prediction methods, where P and RN are used as thepositive and negative training sets respectively. Meanwhile,when applying the built model to predict links in L(i), theoptimal labels, Y(i), of L(i), should be those which can max-imize the following formation probabilities:

Y(i) = arg maxY(i)

p(y(L(i)) = Y(i)|G(1), G(2), · · · , G(n)) (119)

= arg maxY(i)

p(y(L(i)) = Y(i)|[xΦ(L(i))T , xΨ(L(i))T

]T), (120)

where y(L(i)) = Y(i) represents that links in L(i) have labels

Y(i).

5.3.2 Multi-Network Link Prediction FrameworkMethod Mli proposed in [146] is a general link predictionframework and can be applied to predict social links in npartially aligned networks simultaneously. When it comes to

20

n partially aligned network, the optimal labels of potentiallinks L(1),L(2), · · · ,L(n) of networks G(1), · · · , G(n) willbe:

Y(1), Y(2)

, · · · , Y(n)= arg max

Y(1),··· ,Y(n)(121)

p(y(L(1)

) = Y(1), · · · , y(L(n)

) = Yn|G(1), · · · , G(n)

). (122)

The above target function is very complex to solve and,in [146], Mli proposes to obtain the solution by updat-

ing one variable, e.g., Y(1), and fix other variables, e.g.,Y(2), · · · ,Y(n), alternatively with the following equation [129]:

(Y(1))(τ)= arg maxY(1) p

(y(L(1)) = Y(1)|G(1), G(2), · · · , G(n),

(Y2)(τ−1), (Y3)(τ−1), · · · , (Yn)(τ−1)),

(Y(2))(τ)= arg maxY(2) p

(y(L(2)) = Y(2)|G(1), G(2), · · · , G(n),

(Y(1))(τ), (Y(3))(τ−1), · · · , (Y(n))(τ−1)),

· · · · · ·

(Y(n))(τ)= arg maxY(n) p

(y(L(n)) = Y(n)|G(1), G(2), · · · , G(n),

(Y(1))(τ), (Y(2))(τ), · · · , (Y(n−1))(τ)).

(123)

The structure of framework Mli is shown in Figure 5. Whenpredicting social links in network G(i), Mli can extract fea-tures based on the intra-network social meta path extractedfrom G(i) and those extracted based on the inter-networksocial meta path across G(1), G(2), · · · , G(i−1), G(i+1), · · · ,G(n) for links in P(i), U (i) and L(i). Feature vectors x(P),x(U) as well as the labels, y(P), y(U), of links in P and Uare passed to the PU link prediction model M(i) and themeta path selection model MS(i). The formation probabil-ities of links in L(i) predicted by modelM(i) will be used toupdate the network by replace the weights of L(i) with thenewly predicted formation probabilities. The initial weightsof these potential links in L(i) are set as 0. After finishingthese steps on G(i), we will move to conduct similar oper-ations on G(i+1). Mli iteratively predicts links in G(1) toG(n) alternatively in a sequence until the results in all ofthese networks converge.

5.4 Sparse and Low Rank Matrix Estimationbased Inter-Network Link Prediction

Different online social networks usually have different func-tions, and information in them follows totally different dis-tributions. When predicting the links across multiple alignedonline social networks, the link prediction models aforemen-tioned didn’t address the domain difference problem at all.In this section, we will introduce a new cross-network linkprediction model introduced in [125], which embeds the fea-ture vectors of links from aligned networks into a shared fea-ture space. Via the shared feature space, knowledge fromthe source networks will be effectively transferred to the tar-get network.

5.4.1 Link Prediction Objective FunctionLink Prediction Loss Term

Give the target network Gt involving users U t, the observedsocial connection among the users can be represented with

the binary social adjacency matrix At ∈ 0, 1|Ut|×|Ut|, where

entry At(i, j) = 1 iff the corresponding social link (uti, utj)

exists between users uti and utj in Gt. In the studied prob-lem here, our objective is to infer the potential unobservedsocial links for the target network, which can be achieved byfinding a sparse and low-rank predictor matrix S ∈ S from

some convex admissible set S ⊂ R|Ut|×|Ut|. Meanwhile, the

inconsistency between the inferred matrix S and the ob-served social adjacency matrix At can be represented as theloss function l(S,At). The optimal social link predictor forthe target network can be achieved by minimizing the lossterm, i.e.,

arg minS∈S

l(S,At). (124)

The loss function l(S,At) can be defined in many differentways, and, in [125], the loss function is approximated bycounting the loss introduced by the existing social links inEtu, i.e.,

l(S,At) =1

|Etu|∑

(uti,utj)∈E

tu

1

((At(i, j)−

1

2

)· S(i, j) ≤ 0

). (125)

Intra-Network Attribute based Intimacy Term

Besides the connection information, there also exists a largeamount of attribute information available in the target net-work, e.g., location checkin records, online social activitytemporal patterns, and text usage patterns, etc. Based on theattribute information, a set of features can be extracted forall the potential user pairs to denote their closeness, whichare called the intimacy features formally. For instance, givenuser pair (uti, u

tj) in the target network, its intimacy features

can be represented as vector xti,j ∈ Rdt

(dt denotes the ex-tracted intimacy feature number).

More generally, the feature vectors extracted for user pairs

can be represented as a 3-way tensor Xt ∈ Rdt×|Ut|×|Ut|,

where slice Xt(k, :, :) denote all the kth intimacy featuresamong all the user pairs. In online social networks, ho-mophily principle [67] has been observed to widely structurethe users’ online social connections, and users who are closeto each other are more likely to be friends. Based on suchan intuition, the potential social connection matrix S canbe inferred by maximizing the overall intimacy scores of theinferred new social connections, i.e.,

arg maxS∈S

int(S,Xt). (126)

In [125], the introduced model proposes to define the inti-macy score term int(S,Xt) by enumerating and summingthe intimacy scores of the inferred social connections, i.e.,

int(S,Xt) =

dt∑k=1

∥∥S Xt(k, :, :)∥∥

1, (127)

where operator denotes the Hadamard product (i.e., en-trywise product) of matrices.

Intra-Network Attribute based Intimacy Term

Furthermore, with the information from the external sourcenetworks, more knowledge can be obtained about the usersand their social patterns. By projecting the link instancesto a shared feature space as introduced in [125], the theadapted features from the target network and external sourcescan be represented as tensors Xt, X1, · · · , XK . Formally,the intimacy scores of the potential social links based onthese adapted features from the external source networkscan be represented as

int(S, X1, · · · , XK) =

K∑k=1

αi · int(S, Xk), (128)

21

where term int(S, Xk) =∥∥∥S Xk

∥∥∥1, and users in Xk are or-

ganized in the same order as Xt. Parameters αi denotes theimportance of the information transferred from the sourcenetwork Gi.

Joint Objective Function

By adding the intimacy terms about the source networksinto the objective function, the equation can be rewriten asfollows:

arg minS∈S

l(S,At)− αt · int(S, Xt)−K∑k=1

αi · int(S, Xk)) (129)

+ γ · ‖S‖1 + τ · ‖S‖∗ , (130)

where ‖S‖1 and ‖S‖∗ denote the L1-norm and trace-normof matrix S respectively.

5.4.2 Proximal Operator based CCCP AlgorithmBy studying the objective function, we observe that the inti-macy terms are convex while the empirical loss term l(S,At)is non-convex. In [125], the introduced model proposes toapproximate it with other classical loss functions (e.g., thehinge loss and the Frobenius norm) instead, and the con-vex squared Frobenius norm loss function is used in [125]

(i.e., l(S,At) =∥∥S−At

∥∥2

F). Therefore, the above ob-

jective function can be represented as a convex loss termminus another convex term together with two convex non-differentiable regularizers, which actually renders the objec-tive function non-trivial. According to the existing works[116; 96], this kind of objective function can be addressedwith the concave-convex procedure (CCCP). CCCP is amajorization-minimization algorithm that solves the differ-ence of convex functions problems as a sequence of convexproblems. Meanwhile, the regularization terms can be effec-tively handled with the proximal operators in each iterationof the CCCP process.

CCCP Algorithm

Formally, the objective function can be decomposed into twoconvex functions:

u(S) = l(S,At) + γ · ‖S‖1 + τ · ‖S‖∗ , (131)

v(S) = αt · int(S, Xt) +

K∑k=1

αi · int(S, Xk). (132)

With u(S) and v(S), the objective function can be rewrittenas

arg minS∈S

u(S)− v(S). (133)

The CCCP algorithm can address the objective functionwith an iterative procedure that solves the following se-quence of convex problems:

S(h+1) = arg minS∈S

u(S)− S>∇v(S(h)). (134)

It is easy to show that function v(S) differentiable, and thederivative of function v(S) is actually a constant term

∇v(S) =

K∑k=t

αic∑i=1

Xk(i, :, :). (135)

By relying on the Zangwill’s global convergence theory [119]of iterative algorithms, it is theoretically proven in [96] thatas such a procedure continues, the generated sequence of the

Algorithm 4 Proximal Operator Based CCCP Algorithm

Input: social adjacency matrix A

projected feature tensors Xt, X1, · · · , XK

Output: link predictor matrix S1: Initialize matrix Scccp = A2: Initialize CCCP convergence CCCP-tag = False3: while CCCP-tag == False do4: Initialize Proximal convergence Proximal-tag = False5: Solve optimization function minS∈S u(S)− S>∇v(Scccp)6: Initialize Spo = Scccp7: while Proximal-tag == False do8: Spo = Spo − θ∇S

(l(Spo,A)− S>po∇v(Scccp)

)9: Spo = proxθτ‖·‖∗ (Spo)

10: Spo = proxθγ‖·‖1 (Spo)

11: if Spo converges then12: Proximal-tag = True13: Scccp = Spo14: end if15: end while16: if Scccp converges then17: CCCP-tag = True18: end if19: end while20: Return Scccp

variables S(h)∞h=0 will converge to some stationary pointsS∗ in the inference space S.

Proximal Operators

Meanwhile, in each iteration of the CCCP updating process,objective function is not easy to address due to the non-differentiable regularizers. Some works have been done todeal with the objective function involving non-smooth func-tions. The Forward-Backward splitting method proposed in[18] can handle such a kind of optimization function withone single non-smooth regularizer based on the introducedproximal operators. More specifically, as introduced in [18],the proximal operators for the trace norm and L1 norm canbe represented as follows

proxτ‖·‖∗(S) = Udiag((σi − τ)+)iV>, (136)

proxγ‖·‖1(S) = sgn(S) (|S| − γ)+, (137)

where S = Udiag(σi)iV> denotes the singular decompo-

sition of matrix S, and diag(σi)i represents the diagonalmatrix with values σi on the diagonal.

Recently, some works have proposed the generalized Forward-Backward algorithm to tackle the case with q(q ≥ 2) non-differentiable convex regularizers [80]. These methods al-ternate the gradient step and the proximal steps to updatethe variables. For instance, given the above objective func-tion in iteration h of the CCCP, the alternative updatingequations in step k to address the objective function can berepresented as follows:

S(k) = S(k−1) − θ · ∇S

(l(S,A)− S>∇v(S(h))

),

S(k) = proxθτ‖·‖∗(S(k)),

S(k) = proxθγ‖·‖1(S(k)),

(138)where the parameter θ denotes the learning rate and it is as-signed with a very small value to ensure the converge of theabove functions [83]. The pseudo-code of the Proximal Op-erators based CCCP algorithm is available in Algorithm 4.

6. COMMUNITY DETECTIONIn the real-world online social networks, users tend to form

22

different social groups [4]. Users belonging to the samegroups usually have more frequent interactions with eachother, while those in different groups will have less interac-tions on the other hand [149]. Formally, such social groupsform by users in online social networks are called the on-line social communities [139]. Online social communitieswill partition the network into a number of connected com-ponents, where the intra-community social connections areusually far more dense compared with the inter-communitysocial connections [139]. Meanwhile, from the mathematicalrepresentation perspective, due to these online social com-munities, the social network adjacency matrix tend to benot only sparse but also low-rank [143].

Identifying the social communities formed by users in onlinesocial networks is formally defined as the community detec-tion problem [139; 137; 40]. Community detection is a veryimportant problem for online social network studies, as itcan be crucial prerequisite for numerous concrete social net-work services: (1) better organization of users’ friends in on-line social networks (e.g., Facebook and Twitter), which canbe achieved by applying community detection techniques topartition users’ friends into different categories, e.g., school-mates, family, celebrities, etc. [29]; (2) better recommendersystems for users with common shopping preference in e-commerce social sites (e.g., Amazon and Epinions), whichcan be addressed by grouping users with similar purchaserecords into the same clusters prior to recommender sys-tem building [85]; and (3) better identification of influentialusers [104] for advertising campaigns in online social net-works, which can be attained by selecting the most influen-tial users in each community as the seed users in the viralmarketing [84].

In this section, we will focus on introducing the social com-munity detection problem in online social networks. Givena heterogeneous network G with node set V, the involveduser nodes in network G can be represented as set U ⊂ V.Based on both the social structures among users as well asthe diverse attribute information from the network G, thesocial community detection problem aims at partitioning theuser set U into several subsets C = U1,U2, · · · ,Uk, whereeach subset Ui, i ∈ 1, 2, · · · , k is called a social commu-nity. Term k formally denotes the total number of parti-tioned communities, which is usually provided as a hyper-parameter in the problem.

Depending on whether the users are allowed to be parti-tioned into multiple communities simultaneously or not, thesocial community detection problem can actually be catego-rized into two different types:

• Hard Social Community Detection: In the hard socialcommunity detection problem, each user will be par-titioned into one single community, and all the socialcommunities are disjoint without any overlap. In otherwords, given the communities C = U1,U2, · · · ,Ukdetected from network G, we have U =

⋃i Ui and

Ui ∩ Uj = ∅, ∀i, j ∈ 1, 2, · · · , k ∧ i 6= j.

• Soft Social Community Detection: In the soft socialcommunity detection problem, users can belong to mul-tiple social communities simultaneously. For instance,if we apply the Mixture-of-Gaussian Soft Clustering al-gorithm as the base community detection model [148;113], each user can belong to multiple communitieswith certain probabilities. In the soft social commu-

nity detection result, the communities are no longerdisjoint and will share some common users with othercommunities.

Meanwhile, depending on the network connection structures,the community detection problem can be categorized as di-rected network community detection [65] and undirected net-work community detection [149]. Based on the heterogeneityof the network information, the community detection prob-lem can be divided into the homogeneous network commu-nity detection [108] and heterogeneous network communitydetection [87; 99; 127; 143]. Furthermore, according to thenumber of networks involved, the community detection prob-lem involves single network community detection [58] andmultiple network community detection [139; 137; 40; 127;143]. In this section, we will take the hard community detec-tion problem as an example to introduce the existing mod-els proposed for conventional (one single) homogeneous so-cial network, and especially the recent broad learning based(multiple aligned) heterogeneous social networks [51; 128;129; 146] respectively.

This section is organized as follows. At the beginning, inSection 6.1, we will introduce the community detection prob-lem and the existing methods proposed for traditional onesingle homogeneous networks. After that, we will talk aboutthe latest research works on social community detectionacross multiple aligned heterogeneous networks. The coldstart community detection [137] is introduced in Section 6.2,in which we will talk about a new information transfer algo-rithm to propagate information from other developed sourcenetworks to the emerging target network. In Section 6.3, wewill be focused on the concurrent mutual community detec-tion [139] across multiple aligned heterogeneous networkssimultaneously, where information from other aligned net-works will be applied to refine their community detectionresults mutually. Finally, in Section 6.4, we talk about thesynergistic community detection across multiple large-scalenetworks based on the distributed computing platform [40].

6.1 Traditional Homogeneous Network Com-munity Detection

Social community detection problem has been studied fora long time, and many community detection models havebeen proposed based on different types of techniques. Inthis section, we will talk about the social community detec-tion problem for one single homogeneous network G, whoseobjective is to partition the user set U in network G intok disjoint subsets C = U1,U2, · · · ,Uk, where U =

⋃i Ui

and Ui ∩ Uj = ∅,∀i, j ∈ 1, 2, · · · , k. Several different com-munity detection methods will be introduced, which includenode proximity based community detection, modularity max-imization based community detection, and spectral clusteringbased community detection.

6.1.1 Node Proximity based Community DetectionThe node proximity based community detection method as-sumes that “close nodes tend to be in the same communi-ties, while the nodes far away from each other will belongto different communities”. Therefore, the node proximitybased community detection model partition the nodes intodifferent clusters based on the node proximity measures [61].Various node proximity measures can be used here, includingthe node structural equivalence to be introduced as follows,

23

as well as various node closeness measures as introduced inSection 5.1.1.

In a homogeneous network G, the proximity of nodes, likeu and v, can be calculated based on their positions andconnections in the network structure.

Definition 27. (Structural Equivalence): Given a net-work G = (V, E), two nodes u, v ∈ V are said to be structuralequivalent iff

1. Nodes u and v are not connected and u and v share thesame set of neighbors (i.e., (u, v) /∈ E ∧ Γ(u) = Γ(v)),

2. Or u and v are connected and excluding themselves, uand v share the same set of neighbors (i.e., (u, v) ∈E ∧ Γ(u) \ v = Γ(v) \ u).

For the nodes which are structural equivalent, they are sub-stitutable and switching their positions will not change theoverall network structure. The structural equivalence con-cept can be applied to partition the nodes into differentcommunities. For the nodes which are structural equiva-lent, they can be grouped into the same communities, whilefor the nodes which are not equivalent in their positions,they will be partitioned into different groups. However, thestructural equivalence can be too restricted for practical ap-plication in detecting the communities in real-world socialnetworks. Computing the structural equivalence relation-ships among all the node pairs in the network can lead tovery high time cost. What’s more, the structural equivalencerelationship will partition the social network structure intolots of small-sized fragments, since the users will have dif-ferent social patterns in making friends online and few userwill have identical neighbors actually.

To avoid the weakness mentioned above, some other mea-sures are proposed to measure the proximity among nodes inthe networks. For instance, as introduced in Section 5.1.1,the node closeness measures based on the social connectionscan all be applied here to compute the node proximity, e.g.,“common neighbor”, “Jaccard’s coefficient”. Here, if we use“common neighbor” as the proximity measure, by applyingthe “common neighbor” measure to the network G, the net-work G can be transformed into a set of instances V withmutual closeness scores c(u, v)u,v∈V . Some existing simi-larity/distance based clustering algorithms, like k-Medoids,can be applied to partition the users into different commu-nities.

6.1.2 Modularity Maximization based Community De-tection

Besides the pairwise proximity of nodes in the network, theconnection strength of a community is also very important inthe community detection process. Different measures havebeen proposed to compute the strength of a community, likethe modularity measure [72] to be introduced in this part.

The modularity measure takes account of the node degreedistribution. For instance, given the network G, the ex-pected number of links existing between nodes u and v with

degrees D(u) and D(v) can be represented as D(u)·D(v)2|E| .

Meanwhile, in the network, the real number of links exist-ing between u and v can be denoted as entry A[u, v] in thesocial adjacency matrix A. For the user pair (u, v) with a

low expected connection confidence score, if they are con-nected in the real world, it indicates that u and v have arelatively strong relationship with each other. Meanwhile, ifthe community detection algorithm can partition such userpairs into the same group, it will be able to identify verystrong social communities from the network.

Based on such an intuition, the strength of a community,e.g., Ui ∈ C can be defined as∑

u,v∈Ui

(A[u, v]−

D(u) ·D(v)

2|E|

). (139)

Furthermore, the strength of the overall community detec-tion result C = U1,U2, · · · ,Uk can be defined as the mod-ularity of the communities as follows.

Definition 28. (Modularity): Given the community de-tection result C = U1,U2, · · · ,Uk, the modularity of thecommunity structure is defined as

Q(C) =1

2|E|∑Ui∈C

∑u,v∈Ui

(A[u, v]−

D(u) ·D(v)

2|E|

). (140)

The modularity concept effectively measures the strength ofthe detected community structure. Generally, for a commu-nity structure with a larger modularity score, it indicates agood community detection result.

Another way to explain the modularity is from the numberof links within and across communities. By rewriting theabove modularity equation, we can have

Q(C) (141)

=1

2|E|

∑Ui∈C

∑u,v∈Ui

(A[u, v]−

D(u) ·D(v)

2|E|

)(142)

=1

2|E|

∑Ui∈C

∑u,v∈Ui

A[u, v]−∑Ui∈C

∑u,v∈Ui

D(u) ·D(v)

2|E|

(143)

=1

2|E|

∑Ui∈C

∑u,v∈Ui

A[u, v]−1

2|E|

∑Ui∈C

∑u∈Ui

D(u)∑u∈Ui

D(v)

(144)

=1

2|E|

∑Ui∈C

∑u,v∈Ui

A[u, v]−1

2|E|

∑Ui∈C

(∑u∈Ui

D(u))2

. (145)

In the above equation, term∑u,v∈Ui A[u, v] denotes the

number of links connecting users within the community Ui(which will be 2 times the intra-community links for undi-rected networks, as each link will be counted twice). Term∑u∈Ui D(u) denotes the sum of node degrees in commu-

nity Ui, which equals to the number of intra-community andinter-community links connected to nodes in community Ui.If there exist lots of inter-community links, then the modu-larity measure will have a smaller value. On the other hand,if the inter-community links are very rare, the modularitymeasure will have a larger value. Therefore, maximizing thecommunity modularity measure is equivalent to minimizingthe inter-community link numbers.

The modularity measure can also be represented with lin-ear algebra equations. Let matrix A denote the adjacencymatrix of the network, and vector d ∈ R|V|×1 denote thedegrees of nodes in the network. The modularity matrix canbe defined as

B = A− dd>

2|E| . (146)

Let matrix H ∈ 0, 1|V|×k denotes the communities thatusers in V belong to. In real application, such a binary

24

constraint can be relaxed to allow real value solutions formatrix H. The optimal community detection result can beobtained by solving the following objective function

max1

2|E|Tr(H>BH) (147)

s.t. H>H = I, (148)

where constraint H>H = I ensures there are not overlap inthe community detection result.

The above objective function looks very similar to the ob-jective function of spectral clustering to be introduced in thenext section. After obtaining the optimal H, the commu-nities can be obtained by applying the K-Means algorithmto H to determine the cluster labels of each node in thenetwork.

6.1.3 Spectral Clustering based Community Detec-tion

In the community detection process, besides maximizing theproximity of nodes belonging to the same communities (asintroduced in Section 6.1.1), minimizing the connectionsamong nodes in different clusters is also an important factor.Different from the previous proximity based community de-tection algorithms, another way to address the communitydetection problem is from the cost perspective. Partitionthe nodes into different clusters will cut the links among theclusters. To ensure the nodes partitioned into different clus-ters have less connections with each other, the number oflinks to be cut in the community detection process shouldbe as small as possible [89; 107].

Cut

Formally, given the community structure C = U1,U2, · · · ,Ukdetected from network G. The number of links cut [89] be-tween communities Ui,Uj ∈ C can be represented as

cut(Ui,Uj) =∑u∈Ui

∑v∈Uj

I(u, v), (149)

where function I(u, v) = 1 if (u, v) ∈ E ; otherwise, it will be0.

The total number of links cut in the partition process canbe represented as

cut(C) =∑Ui∈C

cut(Ui, Ui), (150)

where set Ui = C \ Ui denotes the remaining communitiesexcept Ui.By minimizing the cut cost introduced in the partition pro-cess, the optimal community detection result can be ob-tained with the minimum number of cross-community links.However, as introduced in [89; 107], by minimizing the cutof edges across clusters, the results may involve high im-balanced communities, some community may involve onesingle node. Such a problem will be much more severe whenit comes to the real-world social network data. In the fol-lowing part of this section, we will introduce two other costmeasures that can help achieve more balanced communitydetection results.

Ratio-Cut and Normalized-Cut

As shown in the example, the minimum cut cost treat allthe links in the network equally, and can usually achievevery imbalanced partition results (e.g., a singleton node as

a cluster) when applied in the real-world community de-tection problem. To overcome such a disadvantage, somemodels have been proposed to take the community size intoconsideration. The community size can be calculated bycounting the number of nodes or links in each community,which will lead to two new cost measures: ratio-cut andnormalized-cut [89; 107].

Formally, given the community detection result C = U1,U2,· · · ,Uk in network G, the ratio-cut and normalized-cutcosts introduced in the community detection result can bedefined as follows respectively.

ratio− cut(C) =1

k

∑Ui∈C

cut(Ui, Ui)|Ui|

, (151)

where |Ui| denotes the number of nodes in community Ui.

ncut(C) =1

k

∑Ui∈C

cut(Ui, Ui)vol(Ui)

, (152)

where vol(Ui) denotes the degree sum of nodes in communityUi.As shown in the above example, from the computed costs,we find that the community detected in plot C achieves muchlower ratio-cut and ncut costs compared with those in plotsB and D. Compared against the regular cut cost, both ratio-cut and normalized-cut prefer a balanced partition of thesocial network.

Spectral Clustering

Actually the objective function of both ratio-cut and normalized-cut can be unified as the following linear algebra equation

minH∈0,1|V|×k

Tr(H>LH), (153)

where matrix H ∈ 0, 1|V|×k denotes the communities thatusers in V belong to.

Let A ∈ 0, 1|V|×|V| denote the social adjacency matrixof the network, and the corresponding diagonal matrix of Acan be represented as matrix D, where D has value D(i, i) =∑j A(i, j) on its diagonal. The Laplacian matrix of the

network adjacency matrix A can be represented as L =D−A. Depending on the specific measures applied, matrixL can be represented as

L =

L, for ratio-cut measure,

D−12 LD

−12 , for normalized-cut measure.

(154)

The binary constraint on the variable H renders the problema non-linear integer programming problem, which is veryhard to solve. One common practice to learn the variableH is to apply spectral relaxation to replace the binary con-straint with the orthogonality constraint.

min Tr(H>LH), (155)

s.t.H>H = I. (156)

As proposed in [89], the optimal solution H∗ to the aboveobjective function equals to the eigen-vectors correspondingto the k smallest eigen-values of matrix L.

6.2 Emerging Network Community DetectionThe community detection algorithms introduced in the pre-vious section are mostly proposed for one single homoge-neous network. However, in the real world, most of the

25

online social networks are actually heterogeneous containingvery complex information. In recent years, lots of new onlinesocial networks have emerged and start to provide services,the information available for the users in these emerging net-works is usually very limited. Meanwhile, many of the usersare also involved in multiple online social networks simulta-neously. For users who are using these emerging networks,they may also be involved in other developed social net-works for a long time [137; 121]. The abundant informationavailable in these mature networks can actually be useful forthe community detection in the emerging networks. In thissection, we will introduce the cross-network community de-tection for emerging networks with information transferredfrom other mature social networks [137].

In this part, we will introduce the social community de-tection for emerging networks with information propagatedacross multiple partially aligned social networks, which isformally defined as the “emerging network community detec-tion” problem. Especially, when the network is brand new,the problem will be the “cold start community detection”problem. Cold start problem is mostly prevalent in recom-mender systems [128], where the system cannot draw anyinferences for users or items, for which it has not yet gath-ered sufficient information, but few works have been done onstudying the cold start problem in clustering/community de-tection problems. The “emerging network community detec-tion” problem and “cold start community detection” prob-lem studied in this section are both novel problems and verydifferent from other existing works on community detectionwith abundant information.

Networks studied in this section can be formulated as twopartially aligned attribute augmented heterogeneous net-works: G = ((Gt, Gs), (At,s, As,t)), where Gt and Gs arethe emerging target network and well-developed source net-work respectively and At,s, As,t are the sets of anchor linksbetween Gt and Gs. Both Gt and Gs can be formulated asthe attribute augmented heterogeneous social network, e.g.,Gt = (Vt, Et,At) (where sets Vt, Et and At denote the usernodes, social links and diverse attributes in the network).With information propagated across G, the intimacy matrix,H, among users in Vt can be computed. emerging networkcommunity detection problem aims at partitioning user setVt of the emerging network Gt into K disjoint clusters, C =C1, C2, · · · , CK, based on the intimacy matrix, H, where⋃Ki Ci = Vt and Ci ∩ Cj = ∅, ∀i, j ∈ 1, 2, · · · ,K, i 6= j.

When the target network Gt is brand new, i.e., Et = ∅ andAt = ∅, the problem will be the cold start community detec-tion problem.

To solve all the above challenges, we will introduce a novelcommunity detection method, Cad, proposed in [137]. Cadintroduces a new concept, intimacy, to measure the closenessrelationships among users with both link and attribute in-formation in online social networks. Useful information fromaligned well-developed networks will be propagated via Cadto the emerging network to solve the shortage of informationproblem.

6.2.1 Intimacy Matrix of Homogeneous NetworkThe Cad model is built based on the closeness scores amongusers, which is formally called the intimacy scores in thissection. Here, we will introduce the intimacy scores and in-timacy matrix used in Cad from a information propagationperspective.

For a given homogeneous network, e.g., G = (V, E), whereV is the set of users and E is the set of social links amongusers in V, the adjacency matrix of G can be defined to beA ∈ R|V|×|V|, where A(i, j) = 1, iff (ui, uj) ∈ E . Meanwhile,via the social links in E , information can propagate amongthe users within the network, whose propagation paths canreflect the closeness among users [75]. Formally, term

pji =A(j, i)√∑

mA(j,m)∑nA(n, i)

(157)

is called the information transition probability from uj to ui,which equals to the proportion of information propagatedfrom uj to ui in one step.

We can use an example to illustrate how information prop-agates within the network more clearly. Let’s assume thatuser ui ∈ V injects a stimulation into network G initially andthe information will be propagated to other users in G viathe social interactions afterwards. During the propagationprocess, users receive stimulation from their neighbors andthe amount is proportional to the difference of the amount ofinformation reaching the user and his neighbors. Let vectorf (τ) ∈ R|V| denote the states of all users in V at time τ , i.e.,the proportion of stimulation at users in V at τ . The changeof stimulation at ui at time τ + ∆t is defined as follows:

f (τ+∆t)(i)− f (τ)(i)

∆t= α

∑uj∈V

pji(f(τ)(j)− f (τ)(i)), (158)

where coefficient α can be set as 1. The transition proba-bilities pij , i, j ∈ 1, 2, · · · , |V| can be represented with thetransition matrix

X = (D−12 AD−

12 ) (159)

of network G, where X ∈ R|V|×|V|, X(i, j) = pij and diago-

nal matrix D ∈ R|V|×|V| has value D(i, i) =∑|V|j=1 A(i, j) on

its diagonal.

Definition 29. (Social Transition Probability Matrix): Thesocial transition probability matrix of network G can be rep-resented as Q = X−DX, where X is the transition matrixdefined above and diagonal matrix DX has value DX(i, i) =∑|V|j=1 X(i, j) on its diagonal.

Furthermore, by setting ∆t = 1, denoting that stimulationpropagates step by step in a discrete time through network,the propagation updating equation can be rewritten as:

f (τ) = f (τ−1) + α(X−DX)f (τ−1) = (I + αQ)f (τ−1) (160)

= (I + αQ)τf (0). (161)

Such a propagation process will stop when f (τ) = f (τ−1),i.e.,

(I + αQ)(τ) = (I + αQ)(τ−1). (162)

The smallest τ that can stop the propagation is defined asthe stop step. To obtain the stop step τ , Cad need to keepchecking the powers of (I + αQ) until it doesn’t change asτ increases, i.e., the stop criteria.

Definition 30. (Intimacy Matrix): Matrix

H = (I + αQ)τ ∈ R|V|×|V| (163)

is defined as the intimacy matrix of users in V, where τ isthe stop step and H(i, j) denotes the intimacy score betweenui and uj ∈ V in the network.

26

6.2.2 Intimacy Matrix of Attributed HeterogeneousNetwork

Real-world social networks can usually contain various kindsof information, e.g., links and attributes, and can be formu-lated as G = (V, E ,A). Attribute set A = a1, a2, · · · ,am, ai = ai1, ai2,· · · , aini, can have ni different val-ues for i ∈ 1, 2, · · · ,m. An example of attribute aug-mented heterogeneous network is given in Figure 6, whereFigure 6(a) is the input attribute augmented heterogeneousnetwork. Figures 6(b)-6(d) show the attribute informationin the network, which include timestamps, text and loca-tion checkins. Including the attributes as a special type ofnodes in the graph definition provides a conceptual frame-work to handle social links and node attributes in a unifiedframework. The effect on increasing the dimensionality ofthe network will be handled as in Lemma 6.2.2 in lowerdimensional space.

Definition 31. (Attribute Transition Probability Matrix):The connections between users and attributes, e.g., ai, canbe represented as the attribute adjacency matrix Aai ∈ R|V|×ni .Based on Aai , Cad formally defines the attribute transi-tion probability matrix from users to attribute ai to be Ri ∈R|V|×ni , where

Ri(i, j) =1√

(∑nim=1 Aai

(i,m))(∑|V|n=1 Aai

(n, j))

Aai(i, j). (164)

Similarly, Cad defines the attribute transition probabilitymatrix from attribute ai to users in V as Si = RT

i .

The importance of different information types in calculat-ing the closeness measure among users can be different. Tohandle the network heterogeneity problem, the Cad modelproposes to apply the micro-level control by giving differentinformation sources distinct weights to denote their differ-ences: ω = [ω0, ω1, · · · , ωm]>, where

∑mi=0 ωi = 1.0, ω0 is

the weight of link information and ωi is the weight of at-tribute ai, for i ∈ 1, 2, · · · ,m.

Definition 32. (Weighted Attribute Transition Probabil-ity Matrix): With weights ω, Cad can define matrices

R = [ω1R1, · · · , ωnRn] , and S = [ω1S1, · · · , ωnSn]> (165)

to be the weighted attribute transition probability matricesbetween users and all attributes, where R ∈ R|V|×(naug−|V|),S ∈ R(naug−|V|)×|V|, naug = (|V| +

∑mi=1 ni) is the number

of all user and attribute nodes in the augmented network.

Definition 33. (Network Transition Probability Matrix):Furthermore, the transition probability matrix of the wholeattribute augmented heterogeneous network G is defined as

Qaug =

[Q R

S 0

], (166)

where Qaug ∈ Rnaug×naug and block matrix Q = ω0Q is theweighted social transition probability matrix of social linksin E.

In the real world, heterogeneous social networks can con-tain large amounts of attributes, i.e., naug can be extremely

large. The weighted transition probability matrix, i.e., Qaug,can be of extremely high dimensions and can hardly fit inthe memory. As a result, it will be impossible to update

the matrix until the stop criteria meets to obtain the stopstep and the intimacy matrix. To solve such problem, Cadproposes to obtain the stop step and the intimacy matrixby applying partitioned block matrix operations with thefollowing Lemma 6.2.2.

Lemma 1. (Qaug)k =

[Qk Qk−1R

SQk−1 SQk−2R

], k ≥ 2, where

Qk =

I, if k = 0,

Q, if k = 1,

QQk−1 + RSQk−2, if k ≥ 2

(167)

and the intimacy matrix among users in V can be representedas

Haug =(I + αQaug

)τ(1 : |V|, 1 : |V|) (168)

=

(τ∑t=0

(τ

t

)αt(Qaug)

t

)(1 : |V|, 1 : |V|) (169)

=

(τ∑t=0

(τ

t

)αt(

(Qaug)t(1 : |V|, 1 : |V|)

))(170)

=

(τ∑t=0

(τ

t

)αtQt

), (171)

where X(1 : |V|, 1 : |V|) is a sub-matrix of X with indexes in

range [1, |V|], τ is the stop step, achieved when Qτ = Qτ−1,

i.e., the stop criteria, Qτ is called the stationary matrix ofthe attributed augmented heterogeneous network.

Proof. The lemma can be proved by induction on k.Considering that (RS) ∈ R|V|×|V| can be precomputed inadvance, the space cost of Lemma 6.2.2 is O(|V|2), where|V| naug.

Since we are only interested in the intimacy and transitionmatrices among user nodes instead of those between the aug-mented items and users for the community detection task,Cad creates a reduced dimensional representation only in-volving users for Qk and H such that Cad can capture theeffect of “user-attribute” and “attribute-user” transition on“user-user” transition. Qk is a reduced dimension represen-tation of Qk

aug, while eliminating the augmented items, itcan still capture the “user-user” transitions effectively.

6.2.3 Intimacy Matrix across Aligned HeterogeneousNetworks

When Gt is new, the intimacy matrix H among users cal-culated based on the information in Gt can be very sparse.To solve this problem, Cad proposes to propagate usefulinformation from other well developed aligned networks tothe emerging network. Information propagated from otheraligned well-developed networks can help solve the shortageof information problem in the emerging network [128; 129].However, as proposed in [74], different networks can havedifferent properties and information propagated from otherwell-developed aligned networks can be very different fromthat of the emerging network as well.

To handle this problem, Cad model proposes to apply themacro-level control technique by using weights, ρs,t, ρt,s ∈[0, 1], to control the proportion of information propagated

27

write

write

write

write

write

check in at

check in at check in atcheck in at

at

at

at

at

timestamps words locationsusers

(a) augmented network

…

at

atat

at

at

at

Users Timestamp Attributes

(b) timestamp attribute

…

write

Users Text Attributes

write

write

write

write

(c) text attribute

Users Checkin Attributes

check in at

check in at

check in at

check in at

check in at

check in at

check in at

(d) checkin attribute

Figure 6: An example of attribute augmented heterogeneous network. (a): attribute augmented heterogeneous network, (b):timestamp attribute, (c): text attribute, (d): location checkin attribute.

between developed network Gs and emerging network Gt. Ifinformation from Gs is helpful for improving the communitydetection results in Gt, Cad can set a higher ρs,t to propa-gate more information from Gs. Otherwise, Cad can set alower ρs,t instead. The weights ρs,t and ρt,s can be adjustedautomatically with method to be introduced in [137].

Definition 34. (Anchor Transition Matrix): To propa-gate information across networks, Cad introduces the an-chor transition matrices between Gt and Gs to be Tt,s ∈R|V

t|×|Vs| and Ts,t ∈ R|Vs|×|Vt|, where entries Tt,s(i, j) =

Ts,t(j, i) = 1, iff (uti, usj) ∈ At,s, uti ∈ Vt, usj ∈ Vs.

Meanwhile, with weights ρs,t and ρt,s, the weighted networktransition probability matrix of Gt and Gs are representedas

Qtaug = (1− ρt,s)

[Qt Rt

St 0

], Q

saug = (1− ρs,t)

[Qs Rs

Ss 0

], (172)

where Qtaug ∈ Rn

taug×n

taug and Qs

aug ∈ Rnsaug×n

saug , ntaug

and nsaug are the numbers of all nodes in Gt and Gs respec-tively.

Furthermore, to accommodate the dimensions, Cad intro-duces the weighted anchor transition matrices between Gs

and Gt to be

Tt,s

= (ρt,s

)

[Tt,s 0

0 0

], and T

s,t= (ρ

s,t)

[Ts,t 0

0 0

], (173)

where Tt,s ∈ Rntaug×n

saug and Ts,t ∈ Rn

saug×n

taug . Nodes

corresponding to entries in Tt,s and Ts,t are of the sameorder as those in Qt

aug and Qsaug respectively.

By combining the weighted intra-network transition prob-ability matrices together with the weighted anchor transi-tion matrices, Cad defines the transition probability matrixacross aligned networks as

Qalign =

[Qtaug Tt,s

Ts,t Qsaug

](174)

where Qalign ∈ Rnalign×nalign , nalign = ntaug + nsaug is thenumber of all nodes across the aligned networks.

Definition 35. (Aligned Network Intimacy Matrix): Ac-cording to the previous remarks, with Qalign, Cad can ob-tain the the intimacy matrix, Halign, of users in Gt to be

Halign = (I + αQalign)τ (1 : |Vt|, 1 : |Vt|), (175)

where Halign ∈ R|Vt|×|Vt|, τ is the stop step.

Meanwhile, the structure of (I + αQalign) can not meetthe requirements of Lemma 6.2.2 as it doesn’t have a zero

square matrix at the bottom right corner. As a result, meth-ods introduced in Lemma 6.2.2 cannot be applied. To ob-tain the stop step, there is no other choice but to keep cal-culating powers of (I + αQalign) until the stop criteria canmeet, which can be very time consuming. In this part, wewill introduce with the following Lemma 6.2.3 adopted byCad model for efficient computation of the high-order pow-ers of matrix (I + αQalign).

Lemma 2. For the given matrix (I + αQalign), its kthpower meets

(I + αQalign)kP = PΛk, k ≥ 1, (176)

matrices P and Λ contain the eigenvector and eigenvalues of(I+αQalign). The ith column of matrix P is the eigenvectorof (I + αQalign) corresponding to its ith eigenvalue λi anddiagonal matrix Λ has value Λ(i, i) = λi on its diagonal.

The Lemma can be proved by induction on k [79]. The timecost of calculating Λk is O(nalign), which is far less thanthat required to calculate (I + αQalign)k.

Definition 36. (Eigen-decomposition based Aligned Net-work Intimacy Matrix): In addition, if P is invertible, wecan have

(I + αQalign)k = PΛkP−1, (177)

where Λk has Λ(i, i)k on its diagonal. And the intimacycalculated based on eigenvalue decomposition will be

Halign =(PΛτP−1) (1 : |Vt|, 1 : |Vt|). (178)

where the stop step τ can be obtained when PΛτP−1 =PΛτ−1P−1, i.e., stop criteria.

Based on the computed matrix Halign, various clusteringmethods, e.g., KMedoids, can be adopted to identify theclusters of the social community.

6.3 Mutual Community DetectionBesides the knowledge transfer from developed networks tothe emerging networks to overcome the cold start problem,information in developed networks can also be transferredmutually to help refine the detected community structuredetected from each of them. In this section, we will intro-duce the mutual community detection problem across mul-tiple aligned heterogeneous networks and introduce a newcross-network mutual community detection model Mcd. Torefine the community structures, a new concept named dis-crepancy is introduced to help preserve the consensus of the

28

community detection result of the shared anchor users ac-cording to [139].

For the given multiple aligned heterogeneous networks G, theMutual Community Detection problem aims to obtain theoptimal communities C(1), C(2), · · · , C(n) for G(1), G(2), · · · ,G(n) simultaneously, where C(i) = U (i)

1 , U(i)2 , . . . , U

(i)

k(i)

is a partition of the users set U (i) in G(i), k(i) =∣∣∣C(i)

∣∣∣,U

(i)l ∩U

(i)m = ∅, ∀ l,m ∈ 1, 2, . . . , k(i) and

⋃k(i)

j=1 U(i)j = U (i).

Users in each detected social community are more denselyconnected with each other than with users in other commu-nities. In this section, we focus on studying the hard (i.e.,non-overlapping) community detection of users in online so-cial networks, and will illustrate a model proposed in paper[139].

Instead of the propagation based social intimacy score com-putation among users, Mcd proposes to use the meta pathsintroduced in Section 3 to utilize both direct and indirectconnections among users in closeness scores calculation. Withfull considerations of the network characteristics, Mcd ex-ploits the information in aligned networks to refine and dis-ambiguate the community structures of the multiple net-works concurrently. More detailed information about theMcd model will be introduced as follows.

6.3.1 Meta Path based Social Proximity MeasureMany existing similarity measures, e.g., “Common Neigh-bor” [38], “Jaccard’s Coefficient” [38], defined for homoge-neous networks cannot capture all the connections amongusers in heterogeneous networks. To use both direct andindirect connections among users in calculating the simi-larity score among users in the heterogeneous informationnetwork, Mcd introduces meta path based similarity mea-sure HNMP-Sim, whose information will be introduced asfollows.

In heterogeneous networks, pairs of nodes can be connectedby different paths, which are sequences of links in the net-work. Meta paths [98; 99] in heterogeneous networks, i.e.,heterogeneous network meta paths (HNMPs), can captureboth direct and indirect connections among nodes in a net-work. The length of a meta path is defined as the numberof links that constitute it. Meta paths in networks can startand end with various node types. However, in this section,we are mainly concerned about those starting and endingwith users, which are formally defined as the social HN-MPs. A formal definition of social HNMPs is available in[146; 139; 145]. The notation, definition and semantics of 7different social HNMPs used in Mcd are listed in Table 1.To extract the social meta paths, prior domain knowledgeabout the network structure is required.

These 7 different social HNMPs in Table 1 can cover lotsof connections among users in networks. Some meta pathbased similarity measures have been proposed so far, e.g.,the PathSim proposed in [98], which is defined for undirectednetworks and considers different meta paths to be of thesame importance. To measure the social closeness amongusers in directed heterogeneous information networks, weextend PathSim to propose a new closeness measure as fol-lows.

Definition 37. (HNMP-Sim): Let Pi(x y) and Pi(x ·) be the sets of path instances of HNMP # i going from x toy and those going from x to other nodes in the network. The

HNMP-Sim (HNMP based Similarity) of node pair (x, y) isdefined as

HNMP-Sim(x, y) =∑i

ωi

( |Pi(x y)|+ |Pi(y x)||Pi(x ·)|+ |Pi(y ·)|

), (179)

where ωi is the weight of the ith HNMP and∑i ωi = 1. In

Mcd, the weights of different HNMPs can be automaticallyadjusted by applying a greedy search technique as introducedin [139; 137].

Let Ai be the adjacency matrix corresponding to the ithHNMP among users in the network and Ai(m,n) = k iffthere exist k different path instances of the ith HNMP fromuser m to n in the network. Furthermore, the similarityscore matrix among users of HNMP # i can be representedas Si = Bi

(Ai + AT

i

), where AT

i denotes the transposeof Ai and Bi represents the sum of the out-degree of user xand y has values Bi(x, y) = 1

(∑mAi(x,m)+

∑mAi(y,m))

. The

symbol represents the Hadamard product of two matrices.The HNMP-Sim matrix of the network which can capture allpossible connections among users is represented as follows:

S =∑i

ωiSi =∑i

ωi(Bi

(Ai + AT

i

)). (180)

6.3.2 Network Characteristic Preservation Cluster-ing

Clustering each network independently can preserve eachnetworks characteristics effectively as no information fromexternal networks will interfere with the clustering results.Partitioning users of a certain network into several clusterswill cut connections in the network and lead to some costsinevitably. Optimal clustering results can be achieved byminimizing the clustering costs.

For a given network G, let C = U1, U2, . . . , Uk be the com-munity structures detected from G. Term Ui = U − Ui isdefined to be the complement of set Ui in G. Various costmeasure of partition C can be used, e.g., cut and normalizedcut as introduced in Section 6.1.3:

cut(C) =1

k

k∑i=1

S(Ui, Ui) =1

k

k∑i=1

∑u∈Ui,v∈Ui

S(u, v), (181)

ncut(C) =1

k

k∑i=1

S(Ui, Ui)

S(Ui, ·)=

1

k

k∑i=1

cut(Ui, U i)

S(Ui, ·), (182)

where term S(u, v) denotes the HNMP-Sim between u, v andS(Ui, ·) = S(Ui,U) = S(Ui, Ui) + S(Ui, U i).

For all users in U , their clustering result can be representedin the result confidence matrix H, where H = [h1, h2, . . . ,hn]T, n = |U|, hi = (hi,1, hi,2, . . . , hi,k) and hi,j denotes theconfidence that ui ∈ U is in cluster Uj ∈ C. The optimal Hthat can minimize the normalized-cut cost can be obtainedby solving the following objective function [107]:

minH

Tr(HTLH), (183)

s.t. HTDH = I. (184)

where L = D−S, diagonal matrix D hasD(i, i) =∑j S(i, j)

on its diagonal, and I is an identity matrix.

6.3.3 Discrepancy based Clustering of Multiple Net-works

29

Table 1: Summary of HNMPs.ID Notation Heterogeneous Network Meta Path Semantics

1 U → U Userfollow−−−−−→ User Follow

2 U → U → U Userfollow−−−−−→ User

follow−−−−−→ User Follower of Follower

3 U → U ← U Userfollow−−−−−→ User

follow−1

−−−−−−−→ User Common Out Neighbor

4 U ← U → U Userfollow−1

−−−−−−−→ Userfollow−−−−−→ User Common In Neighbor

5 U → P → W ← P ← U Userwrite−−−−→ Post

contain−−−−−−→ Word Posts Containing Common Words

contain−1−−−−−−−−→ Post

write−1−−−−−−→ User

6 U → P → T ← P ← U Userwrite−−−−→ Post

contain−−−−−−→ Time Posts Containing Common Timestamps

contain−1−−−−−−−−→ Post


7 U → P → L ← P ← U Userwrite−−−−→ Post

attach−−−−−→ Location Posts Attaching Common Location Check-ins

attach−1−−−−−−−→ Post


Besides the shared information due to common network con-struction purposes and similar network features [137], an-chor users can also have unique information (e.g., socialstructures) across aligned networks, which can provide uswith a more comprehensive knowledge about the commu-nity structures formed by these users. Meanwhile, by max-imizing the consensus (i.e., minimizing the “discrepancy”)of the clustering results about the anchor users in multi-ple partially aligned networks, model Mcd will be able torefine the clustering results of the anchor users with informa-tion in other aligned networks mutually. The clustering re-sults achieved in G(1) and G(2) can be represented as C(1) =

U (1)1 , U

(1)2 , · · · , U (1)

k(1) and C(2) = U (2)1 , U

(2)2 , · · · , U (2)

k(2)respectively.

Let ui and uj be two anchor users in the network, whose

accounts in G(1) and G(2) are u(1)i , u

(2)i , u

(1)j and u

(2)j re-

spectively. If users u(1)i and u

(1)j are partitioned into the

same cluster in G(1) but their corresponding accounts u(2)i

and u(2)j are partitioned into different clusters in G(2), then

it will lead to a discrepancy [139; 87] between the clustering

results of u(1)i , u

(2)i , u

(1)j and u

(2)j in aligned networks G(1)

and G(2).

Definition 38. (Discrepancy): The discrepancy betweenthe clustering results of ui and uj across aligned networksG(1) and G(2) is defined as the difference of confidence scoresof ui and uj being partitioned in the same cluster acrossaligned networks. Considering that in the clustering results,

the confidence scores of u(1)i and u

(1)j (u

(2)i and u

(2)j ) be-

ing partitioned into k(1) (k(2)) clusters can be represented as

vectors h(1)i and h

(1)j (h

(2)i and h

(2)j ) respectively, while the

confidences that ui and uj are in the same cluster in G(1)

and G(2) can be denoted as h(1)i (h

(1)j )T and h

(2)i (h

(2)j )T . For-

mally, the discrepancy of the clustering results about ui and

uj is defined to be dij(C(1), C(2)) =(h

(1)i (h

(1)j )T − h

(2)i (h

(2)j )T

)2

if ui, uj are both anchor users; and dij(C(1), C(2)) = 0 oth-

erwise. Furthermore, the discrepancy of C(1) and C(2) willbe:

d(C(1), C(2)) =n(1)∑i

n(2)∑j

dij(C(1), C(2)), (185)

where n(1) = |U (1)| and n(2) = |U (2)|. In the definition, non-anchor users are not involved in the discrepancy calculation.

However, considering that d(C(1), C(2)) is highly dependent

on the number of anchor users and anchor links between G(1)

and G(2), minimizing d(C(1), C(2)) can favor highly consentedclustering results when the anchor users are abundant buthave no significant effects when the anchor users are veryrare. To solve this problem, model Mcd proposes to mini-mize the normalized discrepancy instead.

Definition 39. (Normalized Discrepancy) The normal-ized discrepancy measure computes the differences of clus-tering results in two aligned networks as a fraction of thediscrepancy with regard to the number of anchor users acrosspartially aligned networks:

nd(C(1), C(2)) =d(C(1), C(2))

(|A(1,2)|) (|A(1,2)| − 1). (186)

Optimal consensus clustering results of G(1) and G(2) will

be ˆC(1), ˆC(2):

C(1), C(2) = arg minC(1),C(2)

nd(C(1), C(2)). (187)

Similarly, the normalized-discrepancy objective function canalso be represented with the clustering results confidencematrices H(1) and H(2) as well. Meanwhile, considering thatthe networks studied in this section are partially aligned,matrices H(1) and H(2) contain the results of both anchorusers and non-anchor users, while non-anchor users shouldnot be involved in the discrepancy calculation according tothe definition of discrepancy. The introduced model pro-poses to prune the results of the non-anchor users with thefollowing anchor transition matrix first.

Definition 40. (Anchor Transition Matrix): Binary ma-

trix T(1,2) (or T(2,1)) is defined as the anchor transition ma-

trix from networks G(1) to G(2) (or from G(2) to G(1)), where

T(1,2) = (T(2,1))T , T(1,2)(i, j) = 1 if (u(1)i , u

(2)j ) ∈ A(1,2) and

0 otherwise. The row indexes of T(1,2) (or T(2,1)) are of the

same order as those of H(1) (or H(2)). Considering that theconstraint on anchor links is “one-to-one” in this section,as a result, each row/column of T(1,2) and T(2,1) containsat most one entry filled with 1.

Furthermore, the objective function of inferring clusteringconfidence matrices, which can minimize the normalized dis-

30

uAuB

uCuD

u1

u2

u3

1 0 0 0

0 0 0 0

0 0 1 0

u1

u2

u3

uA uB uC uD

1 0 0

0 0 0

0 0 1

0 0 0

uA

uB

uC

uD

u1 u2 u3

T(1,2) T(2,1)

Network 1 Network 2

1 0

0 1

1 0

0 1

uA

uB

uC

uD

cluster1 cluster2

1 0

0 1

1 0

u1

u2

u3

cluster1 cluster2

H(1) H(2)

1 0

0 0

1 0

0 0

u1

u3

cluster1 cluster2

1 0

0 0

1 0

uA

uC

cluster1 cluster2

Anchor Transition Matrices

Clustering Confidence Matrices Pruned Confidence Matrices

(T(1,2) )T . H(1) (T(2,1) )T . H(2)

Figure 7: An example to illustrate the clustering discrep-ancy.

Algorithm 5 Curvilinear Search Method (CSM)

Input: Xk Ck, Qk and function Fparameters ε = ρ, η, δ, τ, τm, τM

Output: Xk+1, Ck+1, Qk+1

1: Y(τ) =(I + τ

2A)−1 (

I− τ2A)Xk

2: while F (Y(τ)) ≥ Ck + ρτF ′ ((Y(0))) do3: τ = δτ4: Y(τ) =

(I + τ

2A)−1 (

I− τ2A)Xk

5: end while6: Xk+1 = Yk(τ)Qk+1 = ηQk + 1Ck+1 = (ηQkCk + F(Xk+1)) /Qk+1

τ = max (min(τ, τM ), τm)

crepancy can be represented as follows

minH(1),H(2)

∥∥∥∥H(1)(H(1)

)T− H(2)

(H(2)

)T∥∥∥∥2

F

‖T(1,2)‖2F(‖T(1,2)‖2F − 1

) , (188)

s.t. (H(1))TD(1)H(1) = I, (H(2))TD(2)H(2) = I. (189)

where D(1), D(2) are the corresponding diagonal matrices ofHNMP-Sim matrices of networks G(1) and G(2) respectively.

6.3.4 Joint Mutual Clustering of Multiple NetworksNormalized-Cut objective function favors clustering resultsthat can preserve the characteristic of each network, how-ever, normalized-discrepancy objective function favors con-sensus results which are mutually refined with informationfrom other aligned networks. Taking both of these two issuesinto considerations, the optimal Mutual Community Detec-tion results C(1) and C(2) of aligned networks G(1) and G(2)

can be achieved as follows:

arg minC(1),C(2)

αncut(C(1)) + βncut(C(2)

) + θnd(C(1), C(2)

) (190)

where α, β and θ represents the weights of these terms and,for simplicity, α, β are both set as 1 in Mcd.

By replacing ncut(C(1)), ncut(C(2)), nd(C(1), C(2)) with theobjective equations derived above, the joint objective func-

Algorithm 6 Mutual Community Detector (Mcd)

Input: aligned network: G = G(1), G(2), A(1,2),

A(2,1);number of clusters in G(1) and G(2): k(1) and k(2);HNMP Sim matrices weight: ω;parameters: ε = ρ, η, δ, τ, τm, τM;function F and consensus term weight θ

Output: H(1), H(2)

1: Calculate HNMP Sim matrices, S(1)i and S

(2)i

2: S(1) =∑i ωiS

(1)i , S(2) =

∑i ωiS

(2)i

3: Initialize X(1) and X(2) with Kmeans clustering resultson S(1) and S(2)

4: Initialize C(1)0 = 0, Q

(1)0 = 1 and C

(2)0 = 0, Q

(2)0 = 1

5: converge = False6: while converge = False do7: /* update X(1) and X(2) with CSM */

X(1)k+1, C

(1)k+1, Q

(1)k+1 = CSM(X

(1)k , C

(1)k , Q

(1)k ,F , ε)

X(2)k+1, C

(2)k+1, Q

(2)k+1 = CSM(X

(2)k , C

(2)k , Q

(2)k ,F , ε)

8: if X(1)k+1 and X

(2)k+1 both converge then

9: converge = True10: end if11: end while

12: H(1) =(

(D(1))−12

)TX(1), H(2) =

((D(2))−

12

)TX(2)

tion can be rewritten as follows:

minH(1),H(2)

αTr((H(1)

)T

L(1)

H(1)

) + βTr((H(2)

)T

L(2)

H(2)

) (191)

+ θ

∥∥∥∥H(1)(H(1)

)T− H(2)

(H(2)

)T ∥∥∥∥2

F∥∥T(1,2)∥∥2

F

(∥∥T(1,2)∥∥2

F− 1) , (192)

s.t. (H(1)

)T

D(1)

H(1)

= I, (H(2)

)T

D(2)

H(2)

= I, (193)

where L(1) = D(1) − S(1), L(2) = D(2) − S(2) and matricesS(1), S(2) and D(1), D(2) are the HNMP-Sim matrices andtheir corresponding diagonal matrices defined before.

The objective function is a complex optimization problemwith orthogonality constraints, which can be very difficultto solve because the constraints are not only non-convexbut also numerically expensive to preserve during iterations.Mcd adopts curvilinear search method (i.e., Algorithm 5)with Barzilai-Borwein step [110] to solve the problem, wherethe learning process can also converge quickly. The pseudo-code of the Mcd model is available in Algorithm 6, whichwill call Algorithm 5 for updating the variables iteratively.

6.4 Large-Scale Network Synergistic Commu-nity Detection

The community detection algorithm proposed in the pre-vious section involves very complicated matrix operations,and works well for small-sized network data. However, whenbeing applied to handle real-world online social networks in-volving millions even billions of users, they will suffer fromthe time complexity problem a lot. The problem to be in-troduced here follows the same formulation as the one in-troduced in Section 6.3, but the involved networks are offar larger sizes in terms of both node number and the socialconnection number. Synergistic partitioning across multiplelarge-scale social networks is very difficult for the followingchallenges:

• Social Network : Distinct from generic data, usually

31

Algorithm 7 Edge Weight based Matching (EWM)

Input: Network GhMaximum weight of a node maxVW = n/k

Output: A coarser network Gh+1

1: map() Function:2: for node i in current data bolck do3: if match[i] == −1 then4: maxIdx = −15: sortByEdgeWeight(NN(i))6: for vj ∈ NN(i) do7: if match[j] == −1 and VW (i) + VW (j) < maxVW

then8: maxIdx = j9: end if10: match[i] = maxIdx11: match[maxIdx] = i12: end for13: end if14: end for15: reduce() Function:16: new newNodeID[n+ 1]17: new newVW [n+ 1]18: set idx = 119: for i ∈ 1, 2, · · · , n do20: if i < match[i] then21: set newNodeID[match[i]] = idx22: set newNodeID[i] = idx23: set newVW [i] = newVW [match[i]] = VW (i) +

VW (match[i[)24: idx+ +25: end if26: end for

contains intricate interactions, and multiple hetero-geneous networks mean that the relationships acrossmultiple networks should be taken into consideration.

• Network Scale: Network size implies it is difficult forstand-alone programs to apply traditional partition-ing methods and it is a difficult task to parallelize theexisting stand-alone network partitioning algorithms.

• Distributed Framework : For distributed algorithms,load balance should be taken into consideration andhow to generate balanced partitions is another chal-lenge.

To address the challenges, in this section, we will introducea network structure based distributed network partitioningframework, namely Spmn [40]. The Spmn model identifiesthe anchor nodes among the multiple networks, and selectsa network as the datum network, then divides it into k bal-anced partitions and generate 〈anchor node ID, partition ID〉pairs as the main objective. Based on the objective, Spmncoarsens the other networks (called as synergistic networks)into smaller ones, which will further divides the smallest net-works into k balanced initial partitions, and tries to assignsame kinds of anchor nodes into the same initial partition asmany as possible. Here, anchor nodes of same kind meansthat they are divided into same partition in the datum net-work. Finally, Spmn projects the initial partitions back tothe original networks.

6.4.1 Distributed Multilevel k-way PartitioningIn this section, we describe the heuristic framework for syn-ergistic partitioning among multiple large scale social net-works, and we call the framework Spmn. For large-sizednetworks, data processing in Spmn can be roughly dividedinto two stages: datum generation stage and network align-ment stage.

When got the anchor node set A(1,2) between networks G(1)

and G(2), the Spmn framework will apply a distributed mul-

tilevel k-way partitioning method onto the datum networkto generate k balanced partitions. During this process, theanchor nodes are ignored and all the nodes are treated iden-tically. We call this process datum generation stage. Whenfinished, partition result of anchor nodes will be generated,Spmn stores them in a set-Map〈anidx, pidx〉, where anidxis anchor node ID and pidx represents the partition ID theanchor node belongs to. After the datum generation stage,synergistic networks will be partitioned into k partitions ac-cording to the Map〈anidx, pidx〉 to make the synergisticnetworks to align to the datum network, and during thisprocess discrepancy and cut are the objectives to be mini-mized. We call this process network alignment stage.

Algorithms guaranteed to find out near-optimal partitionsin a single network have been studied for a long period. Butmost of the methods are stand-alone, and performance islimited by the server’s capacity. Inspired by the multilevelk − way partitioning (MKP) method proposed by Karypisand Kumar [44; 43] and based on our previous work [3],Spmn uses MapReduce [22] to speedup the MKP method.As the same with other multilevel methods, MapReducebased MKP also includes three phases: coarsening, initialpartitioning and un-coarsening.

Coarsening phase is a multilevel process and a sequence ofsmaller approximate networks Gi = (Vi, Ei) are constructedfrom the original network G0 = (V, E) and so forth, where|Vi| < |Vi−1|, i ∈ 1, 2, · · · , n. To construct coarser net-works, node combination and edge collapsing should be per-formed. The task can be formally defined in terms of match-ing inside the networks [12]. A intra-network matching canbe represented as a set of node pairs M = (vi, vj), i 6= jand (vi, vj) ∈ E , in which each node can only appear for nomore than once. For a network Gi with a matching Mi, if(vj , vk) ∈Mi then vj and vk will form a new node vq ∈ Vi+1

in network Gi+1 coarsen from Gi. The weight of vq equalsto the sum of weight vj and vk, besides, all the links con-nected to vj or vk in Gi will be connected to vq in Gi+1.The total weight of nodes will remain unchanged during thecoarsening phase but the total weight of edges and numberof nodes will be greatly reduced. Let’s define W (·) to be thesum of edge weight in the input set and N(·) to be the num-ber of nodes/components in the input set. In the coarseningprocess, we have

W (Ei+1) = W (Ei)−W (Mi), (194)

N(Vi+1) = N(Vi)−N(Mi). (195)

Analysis in [42] shows that for the same coarser network,smaller edge-weight corresponds to smaller edge-cut. Withthe help of MapReduce framework, Spmn uses a local searchmethod to implement an edge-weight based matching (EWM)scheme to collect larger edge weight during the coarseningphase. For the convenience of MapReduce, Spmn designsan emerging network representation format: each line con-tains essential information about a node and all its neighbors(NN), such as node ID, vertex weight (VW), edge weight(W), et al. The whole network data are distributed in dis-tributed file system, such as HDFS [91], and each data blockonly contains a part of node set and corresponding connec-tion information. Function map() takes a data block as in-put and searches locally to find node pairs to match ac-cording to the edge weight. Function reduce() is in chargeof node combination, renaming and sorting. With the newnode IDs and matching, a simple MapReduce job will be

32

Algorithm 8 Synergistic Partitioning (SP)

Input: Network GhAnchor Link Map Map < anidx, pidx >Maximum weight of a node maxVW = n/k


1: Call Synergistic Partitioning-Map Function2: Call Synergistic Partitioning-Reduce Function

able to update the edge information and write the coarsernetwork back onto HDFS. The complexity of EWM is O(|E|)in each iteration and pseudo code about EWM is shown inAlgorithm 7.

After several iterations, a coarsest weighted network Gs con-sisting of only hundreds of nodes will be generated. For thenetwork size of Gs, stand-alone algorithms with high com-puting complexity will be acceptable for initial partition-ing. Meanwhile, the weights of nodes and edges of coarsernetworks are set to reflect the weights of the finer networkduring the coarsening phase, so Gs contains sufficient in-formation to intelligently satisfy the balanced partition andthe minimum edge-cut requirements. Plenty of traditionalbisection methods are quite qualified for the task. In Spmn,it adopts the KL method with an O(|E|3) computing com-plexity to divide Gs into two partitions and then take recur-sive invocations of KL method on the partitions to generatebalanced k partitions.

Un-coarsening phase is inverse processing of coarsening phase.With the initial partitions and the matching of the coarsen-ing phase, it is easy to run the un-coarsening process on theMapReduce cluster.

6.4.2 Distributed Synergistic Partitioning ProcessIn this section, we will talk about the synergistic partitioningprocess in Spmn based on the synergistic networks with theknowledge of partition results of anchor nodes from datumnetwork. The synergistic partitioning is also a MKP processbut quite different from general MKP methods.

In the coarsening phase, anchor nodes are endowed withhigher priority than non-anchor nodes. When choosing nodesto pair, Spmn assumes that anchor nodes and non-anchornodes have different tendencies. Let Gd be the datum net-work. For an anchor node vi in another aligned networks, atthe top of its preference list, it would like to matched withanother anchor node vi, which has the same partition ID inthe datum network, i.e., pidx(Gd, vi) = pidx(Gd, vj) (herepidx(Gd, vi) denotes the community label that vj belongs toin Gd). Second, if there is no appropriate anchor node, itwould try to find a non-anchor node to pair. When planingto find a non-anchor node to pair, the anchor node, assum-ing to be vi, would like to find a correct direction, and itwould prefer to match with the non-anchor node vj , whichhas lots of anchor nodes as neighbors with the same pidxwith vi. When being matched together, the new node willbe given the same pidx as the anchor node. To improve theaccuracy of synergistic partitioning among multiple socialnetworks, an anchor node will never try to combine withanother anchor node with different pidx.

For a non-anchor node, it would prefer to be matched withan anchor node neighbor which belongs to the dominantpartition in the non-anchor node’s neighbors. Here, domi-nant partition in a node’s neighbors means the number ofanchor nodes with this partition ID is the largest. Next, anon-anchor node would choose a general non-anchor nodeto pair with. At last, a non-anchor node would not like to

Algorithm 9 Synergistic Partitioning-MapInput: Network Gh

Anchor Link Map Map < anidx, pidx >Maximum weight of a node maxVW = n/k


1: map() Function:2: for node i in current data bolck do3: if match[i] == −1 then4: set flag = false5: sortByEdgeWeight(NN(i))6: if vi ∈Map < anidx, pidx > then7: for vj ∈ NN(i) & match[j] == −1 do8: if vj ∈ Map < anidx, pidx > & Map.get(vi) ==

Map.get(vj) & VW (i) + VW (j) < maxVW then9: match[i] = j,match[j] = i10: flag = true, break11: end if12: end for13: if flag == false, no suitable anchor node then14: for vj ∈ NN(i) & match[j] == −1 & VW (vi) +

VW (vj) < maxVW do15: indirectNeighbor = NN(vj)16: sortByEdgeWeight(NN(i))17: for vk ∈ indirectNeighbor do18: if vk ∈ Map < anidx, pidx > &

Map.get(vi) == Map.get(vk) then19: match[i] = j,match[j] = i20: flag = true, break21: end if22: end for23: if flag == true then24: break25: end if26: end for27: end if28: else29: sortByEdgeWeight(NN(i))30: for vj ∈ NN(vi) & vj /∈ Map < anidx, pidx > &

VW (i) + VW (j) < maxVW & match[j] == −1 do31: match[i] = j, match[j] = i, break32: end for33: end if34: end if35: end for

combine with an anchor node being part of the partitionswhich are in subordinate status. After combined together,the new node will be given the same pidx as the anchornode. To ensure the balance among the partitions, about 1

3of the nodes in the coarsest network are unlabeled.

In addition to minimizing both the discrepancy and cut dis-cussed before, Spmn also tries to balance the size of parti-tions are the objectives in synergistic partitioning process.However, when put together, it is impossible to achievethem simultaneously. So, Spmn tries to make a compro-mise among them and develop a heuristic method to tacklethe problems.

• First, according to the conclusion smaller edge-weightcorresponds to smaller edge-cut and the pairing ten-dencies, Spmn proposes a modified EWM (MEWM)method to find a matching in the coarsening phase, ofwhich the edge-weight is as large as possible. At theend of the coarsening phase, there is no impurity in anynode, meaning that each node contains no more thanone type of anchor nodes. Besides, a “purity” vec-tor attribute and a pidx attribute are added to eachnode to represent the percentage of each kind of an-chor nodes swallowed up by it and the pidx of the newnode, respectively.

• Then, during the initial partitioning phase, Spmn treatsthe anchor nodes as labeled nodes and use a modi-fied label propagation algorithm to deal with the non-anchor nodes in the coarsest network.

33

Algorithm 10 Synergistic Partitioning-ReduceInput: Network Gh

Anchor Link Map Map < anidx, pidx >Maximum weight of a node maxVW = n/k


1: reduce() Function:2: new newNodeID[n+ 1]3: new newVW [n+ 1]4: set idx = 15: for i ∈ newNodeID[] do6: if i < match[i] then7: set newNodeID[match[i]] = idx8: set newNodeID[i] = idx9: set newVW [i] = newVW [match[i]] = VW (i) +

VW (match[i[)10: idx+ +11: end if12: end for13: new newPurity[idx+ 1]14: new newPidx[idx+ 1]15: for i ∈ [1, idx] do

16: newPurity[i] =purity[i]∗VW (i)+purity[j]∗VW (j)

VW (i)+VW (j)

17: newPidx[i] = maxpidx[i], pidx[match[i]]18: end for

Figure 8: An Example of Synergistic Partition Process. Incoarsening phase, the networks are stored in two servers,V i1 = vi(j)|j ≤ |V i|/2 are stored on a sever and the othersare on the other server. Anchor nodes are with colors, anddifferent colors represent different partitions. Node pairs en-circled by dotted chains represent the matchings. Numberson chains mean the order of pairing.

• At the end of the initial partitioning phase, Spmn willbe able to generate balanced k partitions and to max-imize the number of same kind of anchor nodes beingdivided into same partitions.

• Finally, Spmn projects the coarsest network back tothe original network, which is the same as traditionalMKP process.

The pseudo code of coarsening phase in synergistic parti-tioning process is available in Algorithm 8, which will callthe Map() and Reduce() functions in Algorithms 9 and 10respectively.

7. INFORMATION DIFFUSIONSocial influence can be widely spread among people, and in-formation exchange has become one of the most importantsocial activities in the real world. The creation of the In-ternet and online social networks has rapidly facilitated thecommunication among people. Via the interactions among

users in online social networks, information can be prop-agated from one user to other users. For instance, in re-cent years, online social networks have become the mostimportant social occasion for news acquisition, and manyoutbreaking social events can get widely spread in the on-line social networks at a very fast speed. People as themulti-functional “sensors” can detect different kinds of sig-nals happening in the real world, and write posts to reporttheir discoveries to the rest of the world via the online socialnetworks.

In this section, we will study the information diffusion pro-cess in the online social networks. Diffusion denotes thespreading process of certain entities (like information, idea,innovation, even heat in physics and disease in bio-medicalscience) through certain channels among the target objectgroup in a system. The entities to be spread, the channelsavailable, the target object group and the system can allaffect the diffusion process and lead to different diffusionobservations. Therefore, different types of diffusion modelshave been proposed already, which will be introduced in thischapter.

Depending on the system where the diffusion process is orig-inally studied, the diffusion models can be divided into (1)information diffusion models in social networks [47; 145],(2) viral spreading in the bio-medical system [81; 20], and(3) heat diffusion in physical system [70; 10]. We will takethe information diffusion in online social networks as oneexample. The channels for information diffusion belong tocertain sources, like online world diffusion channels and of-fline world diffusion channels, or diffusion channels in dif-ferent social networks. Meanwhile, depending on the dif-fusion channels and sources available, the diffusion modelsinclude (1) single-channel diffusion model [134; 47], (2) sin-gle source multi-channel diffusion model [124], (3) multi-source single-channel diffusion model [123; 120], and (4)multi-source multi-channel diffusion model [121; 145; 122].Based on the categories of topics to be spread in the onlinesocial networks, the diffusion models can be categorized into(1) single topic diffusion [47; 121], (2) multiple intertwinedtopics concurrent diffusion [145; 134; 53; 21; 9].

In the following part of this section, we will introduce dif-ferent kinds of diffusion models proposed to depict how in-formation propagates among users in online social networks.We will first talk about the classic diffusion models proposedfor the single-network single channel scenario, including thethreshold based models, cascades based models, heat diffu-sion based models and viral diffusion based models. Afterthat, several different cross-network diffusion models willbe introduced, including the network coupling based diffu-sion model, multi-source multi-channel diffusion model, andcross-network random walk based diffusion model.

7.1 Traditional Information Diffusion ModelsThe “diffusion” phenomenon has been observed in differentdisciplines, like social science, physics, and bio-medical sci-ence. Various diffusion models have been proposed in theseareas already. In this part, we will provide a brief introduc-tion to these models, and introduce how to apply or adaptthem for describe information diffusion process in online so-cial networks.

Let G = (V, E) represent the network structure, based onwhich we want to study the information diffusion problem.Formally, given a user node u ∈ V, we can represent the set

34

of neighbors of u as Γ(u). Each user node in the network Gwill have an indicator denoting whether the user has beenactivated or not. We will use notation s(u) = 1 to denotethat user u has been activated, and s(u) = 0 to representthat u is still inactive. Initially, all the users are inactive toa certain information. Information can be propagated froman initial influence seed user set S ⊂ V who are exposed toand activated by the information at the very beginning. Ata timestamp in the diffusion process, given user u’s neigh-bor, we can represent the subset of the active neighbors asΓa(u) = v|v ∈ Γ(u), s(v) = 1. The set of inactive neigh-bors can be represented as Γi(u) = Γ(u) \Γa(u). Generally,the information diffusion process will stop if no new activa-tion is available.

7.1.1 Linear Threshold (LT) ModelsIn this subsection, we will introduce the threshold models,and will use linear threshold model as an example to illus-trate such a kind of models. Several different variants ofthe linear threshold models will be briefly introduced hereas well.

Generally, the threshold models assume that individuals havea unique threshold indicating the minimum amount of re-quired information for them to be activated by certain in-formation. Information can propagate among the users, andthe information amount is determined by the closeness of theusers. Close friends can influence each other much more thanregular friends and strangers. If the information propagatedfrom other users in the network surpass the threshold of acertain user, the user will turn to an activated status andalso start to influence other users. Therefore, the thresholdvalues can determine the performance of users in the onlinesocial networks. Depending on the setting of the thresholdsas well as the amount of information propagated among theusers, the threshold models have different variants.

LT Model

In the linear threshold (LT) model [47], each user has aunique threshold denoting the minimum required informa-tion to active the user. Formally, the threshold of user ucan be represented as θu ∈ [0, 1]. In the simulation experi-ments, the threshold values are normally selected from theuniform distribution U(0, 1). Meanwhile, for each user pair,like u, v ∈ V, information can be propagated between them.As mentioned before, close friends will have larger influenceon each other compared with regular friends and strangers.Formally, the amount of information users u can send to v isdenoted as weight wu,v ∈ [0, 1]. Generally, the total amountof informations can send out is bounded. For instance, inthe LT model, the total amount of information user u cansend out is bounded by 1, i.e.,

∑v∈Γ(u) wu,v ≤ 1. Different

ways have been proposed to define the specific value of theweight wu,v value, and in many of the cases wu,v can bedifferent from wv,u since the information each user can sendout can be different. However, in many other cases, to sim-plify the setting, for the same user pair, wu,v and wv,u areusually assigned with the same value. For instance, in someLT models, Jaccard’s Coefficient is applied to calculate thecloseness between the user pairs which will be used as theweight value.

In the LT model, the information sent from the neighborsto user u can be aggregated with linear summation. For in-stance, the total amount of information user u can receivefrom his/her neighbors can be denoted as

∑v∈Γ(u) w(v, u)s(v)

or∑v∈Γau w(v, u). To check whether a user can be acti-

vated or not, LT model will only need to check whether thefollowing equation holds or not,∑

v∈Γau

w(v, u) ≥ θu. (196)

It denotes whether the received information surpasses theactivation threshold of user u or not. Here, we also need tonotice that inactive neighbors will not send out information,and only the active neighbors can send out information. Theinformation provided so far shows the critical details of theLT model. Next, we will show the general framework of theLT model to illustrate how it works.

In the LT model, the initial activated seed user set can berepresented as S, users in which can start the propagationof information to their neighbors. Generally, informationpropagates within the network step by step.

• Diffusion Starts: At step 0, only the seed users in Sare active, and all the remaining users have inactivestatus.

• Diffusion Spreads: At step t(t > 0), for each user u, ifthe information propagated from u’s active neighborsis greater than the threshold of u, i.e.,

∑v∈Γau w(v, u) ≥

θu, u will be activated with status s(u) = 1. All the ac-tivated users will remain active in the coming rounds,and can send out information to the neighbors. Activeusers cannot be activated again.

• Diffusion Ends: If no new activation happens in stept, the diffusion process will stop.

Specifically, in the diffusion process, at step t, we don’t needto check all the users to see whether they will be activatedor not. The reason is that, in the diffusion process, formost of the inactive users, if the status of their neighborsare not changed in the previous step, i.e., step t − 1, theinfluence they can receive in step t will still be the sameas in step t − 1. And they will remain the same status asthey are in the previous step, i.e., “inactive”. Let Va(t− 1)denote the set of users who are recently activated in stept − 1, we can represent the set of users they can influenceas⋃u∈Va(t−1) Γ(u). In step t, these recently activated users

will make changes to the information their neighbors canreceive. Therefore, we only need to check whether the statusof inactive users in the set

⋃u∈Va(t−1) Γ(u) will meet the

activation criterion or not.

After the diffusion process stops, a group of users with theactive status will indicates the influence these seed usersspread to, which can be represented as set Va. Generally,there will exist a mapping: σ : S → |Va|, which is formallycalled the influence function. Given the influence function,with different seed user sets as the input, the influence theycan achieve is usually different. Choosing the optimal seeduser who can lead to the maximum influence is named asthe influence maximization problem.

Other Threshold Models

The LT model assumes the cumulative effects of informa-tion propagated from the neighbors, and can illustrate thebasic information diffusion process among users in the on-line social networks. The LT model has been well analyzed,and many other variant models have been proposed as well.Depending on the assignment of the threshold and weight

35

values, many other different diffusion models can all be re-duced to a special case of the LT model.

• Majority Threshold Model: Different from the LTmode, in majority threshold model [15], an inactiveuser u can be activated if majority of his/her neigh-bors are activated. The majority threshold model canbe reduced to the LT model in the case that: (1) theinfluence weight between any friends (u, v) in the net-work is assigned with value 1; (2) the threshold of anyuser u is set as 1

2D(u), where D(u) denotes the degree

of node u in the network. For the nodes with largedegrees, like the central node in the star-structureddiagram, their activation will lead to the activation oflots of surrounding nodes in the network.

• k-Threshold Model: Another diffusion model simi-lar to the LT model is called the k-threshold diffusionmodel [15], in which users can be activated of at least kof his/her neighbors are active. The k-threshold modelis equivalent to the LT model with settings (1) theinfluence weight between any friend pairs (u, v) in thenetwork is assigned with value 1; and (2) the activationthresholds of all the users are assigned with a sharedvalue k. For each user u, if k of his/her neighbors havebeen activated, u will be activated.

Depending on the values of k, the k-threshold modelwill have different performance. When k = 1, a userwill be activated of at least one of his/her neighbor isactive. In such a case, all the users in the same con-nected components with the initial seed users will beactivated finally. When k is a very large value andeven greater than the large node degree, e.g., k >maxu∈V D(u), no nodes can be activated. When k isa medium value, some of the users will be activated asthe information propagates, but the other users withless than k neighbors will never be activated.

7.1.2 Independent Cascade (IC) ModelAn information cascade occurs when a people observe theactions of others and then engage in the same acts. Cascadeclearly illustrates the information propagation routes, andthe activating actions performed for users to their neighbors.In the cascade model, the information propagation dynam-ics is carried out in a step-by-step fashion. At each step,users can have trials to activate their neighbors to changetheir opinions with certain probabilities. If they succeed, theneighbors will change their status to follow the initiators. Inthe case that multiple users can all have the change to acti-vate certain target user, the activation trials are performedsequentially in an arbitrary order.

Depending on the activation trials and users’ reactions tothe activation trials, different cascade models have been pro-posed already. In this section, we will talk about the cascadebased models and use the independent cascade (IC) modelas an example to illustrate the model architecture.

IC Model

In the diffusion process, about one certain target user, multi-ple activation trials can be performed by his/her neighbors.In the independent cascade model [47], each activation isperformed independently regardless of the historical unsuc-cessful trials. The activation trials are performed step bystep. When user u who has been activated in the previous

step and tries to activate user v in the current step, thesuccess probability is denoted as pu,v ∈ [0, 1]. Generally, ifusers u and v are close friends, the activation probabilitywill be larger compared with regular friends and strangers.The specific activation probability values is usually corre-lated with the social closeness between users u and v, whichcan also be defined based the Jaccard’s Coefficient in thesimulation. The activation trials will only happen amongthe users who are friends. If u succeeds in activating v, thenuser v will change his/her status to “active” and will remainin the status in the following steps. However, if u fails toactivate v, u will lose the chance and cannot perform theactivation trials any more.

In the IC mode, we can represent the initial seed user as setS ⊂ V, who will spread the information to the remainingusers. We illustrate the general information propagationprocedure as follows:

• Diffusion Starts: In the initial, the seed users will sendout the information and start to activate their neigh-bors. For the users in set S, the activation trials willstart from them in a random order. For instance, if wepick user u ∈ S as the first user, u will activate his/herinactive friends in Γ(u) in a random order as well.

• Diffusion Spreads: In step t, only the users who havejust been activated in the previous step can activateother users. We can denote the users who have justbeen activated in the previous as set S(t−1). Users inset S(t− 1) will start to perform activation trials. Forthe users who are activated by these users, they willremain active in the following steps and will be addedto the set S(t), who will start the activation trials inthe next step.

• Diffusion Ends: If no activation happens in a step, thediffusion process stops.

In IC model, the activation trials are performed by flippinga coin with certain probabilities, whose result is uncertain.Even with the same provided initial seed user set S, thenumber of users who will be activated by the seed users canbe different if we running the IC model twice. Formally, wecan represent the set of activated users by the seed usersas Va ⊂ V. Therefore, in the experimental simulations, weusually run the diffusion model multiple times and calculatethe average number of activated users, i.e., |Va|, to denotethe expected influence achieved by the seed user set S.

Other Cascade Models

Generally, the independent activation assumption rendersthe IC model the simplest cascade based diffusion models.In the real world, the diffusion process will be more compli-cated. For the users, who have been failed to be activatedby many other users, it probably indicates that the user isnot interested in the information. Viewed in such a per-spective, the probability for the user to be activated willdecrease as more activation trials have been performed. Inthis part, we will introduce another cascade based diffusionmodel, decreasing cascade model (DC) [48].

To illustrate the DC model more clearly and show it differ-ence compared with the IC model, we use notation P (u →v|T ) to represent the probability for user u to activate vgiven a set of users T have performed and failed the acti-vation trials to v already. Let T , T ′ denote two historical

36

activation trial user set, where T ⊆ T ′. In the IC model, wehave

P (u→ v|T ) = P (u→ v|T ′). (197)

In other words, every activation trial is independent witheach other, and the activation probability will not be changedas more activation trials have been performed.

As introduced at the beginning of this subsection, the factthat users in set T fail to activate v indicates that v probablyis not interested in the information, and the change for vto be activated afterwards will be lower. Furthermore, asmore activation trials, e.g., users in T ′ are performed, theprobability for u to active v will be decreased, i.e.,

P (u→ v|T ) ≥ P (u→ v|T ′). (198)

Intuitively, this restriction states that a contagious node’sprobability of activating some v decreases if more nodeshave already attempted to activate v, and v is hence more“marketing-saturated”. The DC model incorporates the ICmodel as a special case, and is more general in informationdiffusion process modeling than the IC model.

7.1.3 Epidemic Diffusion ModelThe threshold and cascade based diffusion models intro-duced in the previous part mostly assume that “once a useris activated, he/she will remain the active status forever”.However, in the real world, these activated users can changetheir minds and the activated users can still have the chanceto recover to the original status. In the bio-medical science,diffusion models have been studied for many years to modelthe spread of disease, and several epidemic diffusion modelshave been introduced already. In the disease propagation,people who are susceptible to the disease can be get infectedby other people. After some time, many of these infectedpeople can get recovered and become immune to the disease,while many other users can get recovered and get susceptibleto the disease again. Depending on the people’s reactionsto the disease after recovery, several different epidemic dif-fusion models [77] have been proposed already.

In this subsection, we will introduce the epidemic diffusionmodels, and try to use them to model the diffusion of infor-mation in the online social networks.

Susceptible-Infected-Recovered (SIR) Diffusion Model

The SIR model was proposed by W. O. Kermack and A. G.McKendrick in 1927 to model the infectious diseases, whichconsider a fixed population with three main categories: sus-ceptible (S), infected (I), and recovered (R). As the diseasepropagates, the individual status can change among S, I,R following flow:

S → I → R. (199)

In other words, the individuals who are susceptible to thedisease can get infected, while those infected individuals alsohave the chance to recover from the disease as well.

In this part, we will use the SIR model to describe the in-formation cascading process in online social networks. LetV denote the set of users in the network. We introducethe following notations to represent the number of users indifferent categories:

• S(t): the number of users who are susceptible to theinformation at time t, but have not gotten infected yet.

• I(t): the number of users who are currently infectedby the information, and can spread the information toothers in the susceptible catetory.

• R(t): the number of users who have been infected andalready recovered from the information infection. Theusers are immune to the information will not be in-fected again.

Based on the above notations, we have the following equa-tions hold in the SIR model.

S(t) + I(t) +R(t) = |V|, (200)

dS(t)

dt+

dI(t)

dt+

dR(t)

dt= 0, (201)

(202)

where, dS(t)

dt= −βS(t)I(t),

dI(t)dt

= βS(t)I(t)− γI(t),dR(t)

dt= γI(t).

(203)

In the above equations, the parameters β denotes the in-fection rate of these susceptible users by the infected usersin unit time, and γ represents the recovery rate. Generally,all the users in the social network will belong to these threecategories, and the total number of users in these three cat-egories will sum to |V| at any time in the diffusion process.Therefore, we can also get the derivatives of the summationwith regarding to the time parameter t will be 0. At a unittime, the number of users transit from the susceptible statusto the infection status depends on the available susceptibleand infected users at the same time. For each infected user,the number of users he/she can infect is proportional to theavailable susceptible users, which can be denoted ad βS(t).For all the infected users, the total number of users can getinfected will be βS(t)I(t). For the number of users who arerecovered in unit time, it depends on the number of totalinfected users I(t) as well as the recovery rate γ, which canbe represented as γI(t). Meanwhile, as to the number ofinfected user changes in unit time is determined by both thenumber of susceptible users who get infected as well as theinfected users who get recovered.

We have parameters β, γ ≥ 0, and the numbers S(t), I(t),R(t) ≥ 0 to be positive at any time. Therefore, we can know

that (1) dS(t)dt≤ 0, and users in the susceptible group is non-

increasing; (2) dR(t)dt≥ 0, and users in the recovered group is

non-decreasing; while (3) the sign of term dI(t)dt

can be eitherpositive, zero or negative depending on the parameters β, γand the users in the susceptible and infected groups:

• positive: if βS(t) > γ;

• zero: if βS(t) = γ or I(t) = 0;

• negative: βS(t) < γ.

Susceptible-Infected-Susceptible (SIS) Diffusion Model

In some cases, the users cannot get immune to the infor-mation and don’t exist the recovery status actually. Forthe users, who get infected, they can go to the susceptiblestatus and can get infected again in the future. To modelsuch a phenomenon, another diffusion model very similar

37

to the SIR model has been proposed, which is called theSusceptible-Infected-Susceptible (SIS) model.

In the SIS model, the individual status flow is provided asfollows:

S → I → S. (204)

Such a status flow will continue, and individuals will switchtheir status between susceptible and infected in the infor-mation diffusion process. Therefore, the absolute numberchanges of individuals in these two categories will be thesame in unit time.

dS(t)

dt= −βS(t)I(t) + γI(t), (205)

dI(t)

dt= βS(t)I(t)− γI(t). (206)

Susceptible-Infected-Recovered-Susceptible (SIRS)Diffusion Model

The Susceptible-Infected-Recovered-Susceptible (SIRS) dif-fusion model to be introduced in this part is another typeof epidemic model, where the individuals in the recoverycategory can lose the immunity and transit to the suscep-tible category and have the potential to be infected again.Therefore, the individual status flow will be

S → I → R→ S. (207)

We can denote the rate of individuals who lose the immunityas f , and the total number of individuals who may losethe immunity will be f · R(t). Therefore, we can have thederivative of the individual numbers belonging to differentcategories as

dS(t)

dt= −βS(t)I(t) + fR(t), (208)

dI(t)

dt= βS(t)I(t)− γI(t), (209)

dR(t)

dt= γI(t)− fR(t). (210)

Besides these epidemic diffusion models introduced in thissubsection, there also exist many different version of the epi-demic diffusion models, which considers many other factorsin the diffusion process, like the birth/death of individuals.It is also very common in the real-world online social net-works, since new users will join in the social network, andexisting users will also delete their account and get removedfrom the social network. Involving such factors will makethe diffusion model more complex, and we will not intro-duce them here due to the limited space. More informationabout these different epidemic diffusion models is availablein [73; 77].

7.1.4 Heat Diffusion ModelsHeat diffusion is a well observed physical phenomenon. Gen-erally, in a medium, heat will always diffuses from regionswith a high temperature to the region with a lower temper-ature. Recently, many works have applied the heat diffu-sion to model the information propagation in online socialnetworks. In this subsection, we will talk about the heatdiffusion model and introduce how to adapt it to model theinformation diffusion process in online social networks.

General Heat Diffusion

Throughout a geometric manifold, let function f(x, t) denotethe temperature at location x at time t, and we can represent

the initial temperature at different locations as f0(x). Theheat flows with initial conditions can be described by thefollowing second order differential equation

∂f(x,t)∂t

−∆f(x, t) = 0

f(x, 0) = f0(x),(211)

where ∆f(x, t) is a Laplace-Beltrami operator on functionf(x, t).

Many existing works on the heat diffusion studies are mainlyfocused on the heat kernel matrix. Formally, let Kt denotethe heat kernel matrix at timestamp t, which describes theheat diffusion among different regions in the medium. Inthe matrix, entry Kt(x, y) denotes the heat diffused fromthe original position y to position x at time t. However, it isvery difficult to represent the medium as a regular geometrywith a known dimension. In the next part, we will introducehow to apply the heat diffusion observations to model theinformation diffusion in the network-structured graph data.

Heat Diffusion Model

Given a homogeneous network G = (V, E), for each nodeu ∈ V in the network, we can represent the information at uin timestamp t as f(u, t). The initial information availableat each of the node can be denoted as f(u, 0). The informa-tion can be propagated among the nodes in the network ifthere exists a pipe (i.e., a link) between them. For instance,with a link (u, v) ∈ E in the network, information can bepropagated between u and v.

Generally, in the diffusion process, the amount of informa-tion propagated between different nodes in the network de-pends on (1) the difference of information available at them,and (2) the thermal conductivity-the heat diffusion coeffi-cient α. For instance, at timestamp t, we can represent theamount of information reaching nodes u, v ∈ V as f(u, t) andf(v, t). If f(u, t) > f(v, t), information tends to propagatefrom u to v in the network, and the amount of informationpropagated is α · (f(u, t)− f(v, t)), and the propagation di-rection will be reversed if f(u, t) < f(v, t). The informationamount changes at node u at timestamps t and t + ∆t canbe represented as

f(u, t+ ∆t)− f(u, t)

∆t= −

∑v∈Γ(u)

α · (f(u, t)− f(v, t)) .

(212)

Let’s use vector f(t) to represent the amount of informationavailable at all the nodes in the network at timestamp t.The above information amount changes can be rewritten as

f(t+ ∆t)− f(t)

∆t= αHf(t), (213)

where in the matrix H ∈ R|V|×|V|, entry H(u, v) has value

H(u, v) =

1, if (u, v) ∈ E ∨ (v, u) ∈ E ,−D(u), if u = v,

0, otherwise,

(214)

where D(u) denotes the degree of node u in the network.

In the limit case ∆t→ 0, we can rewrite the equation as

df(t)

dt= αHf(t). (215)

Solving the function, we can represent the amount of infor-

38

mation at each node in the network as

f(t) = exptαH f(0) (216)

=

(I + αtH +

α2t2

2!H2 +

α3t3

3!H3 + · · ·

)f(0), (217)

where term exptαH is called the diffusion kernel matrix,which can be expanded according to Taylor’s theorem.

7.2 Intertwined Diffusion ModelsFor the models introduced in the previous section, they areall proposed for modeling the diffusion of information inonline social networks involving one single type of connec-tions in propagating one type of information only. How-ever, in the real world, multiple types of information can bepropagated within the network simultaneously, relationshipsamong which can be quite intertwined, including competi-tive, complimentary and independent. Furthermore, withinthe networks, even the network structure is homogeneousbut the social links among users may be associated withpolarities indicating the relationship among the users. Forinstance, for some of the social links, they denote friendship,while for some of the links, they indicate the user pairs areenemies. Formally, the social network structure with polar-ities associated with the social links are called signed net-works, where the link polarities can affect the informationdiffusion in them greatly.

In this section, we will introduce the intertwined diffusionmodels to describe the information propagation process aboutboth (1) the information entities with intertwined relation-ships, and (2) for network structures with links attachingdifferent polarities. The models to be introduced in thissection are based on [134; 124] respectively.

7.2.1 Intertwined Diffusion Models for Multiple Top-ics

Traditional information diffusion studies mainly focus onone single online social network and has extensive concreteapplications in the real world, e.g., product promotion [21;71] and opinion spread [17]. In the traditional viral market-ing setting [25; 47], only one product/idea is to be promoted.However, in the real scenarios, the promotions of multipleproducts can co-exist in the social networks at the sametime, which is referred to as the intertwined informationdiffusion problem.

The relationships among the products to be promoted in thenetwork can be very complicated. For example, in Figure 9,we show 4 different products to be promoted in an onlinesocial network and HP printer is our target product. At theproduct level, the relationships among these products canbe:

• independent : promotion activities of some products(e.g., HP printer and Pepsi) can be independent ofeach other.

• competing : products having common functions willcompete for the market share [9; 13] (e.g., HP printerand Canon printer). Users who have bought a HPprinter are less likely to buy a Canon printer again.

• complementary : product cross-sell is also very com-mon in marketing [71]. Users who have bought a cer-tain product (e.g., PC) will be more likely to buy an-

HP Printer

Canon Printer

PCPepsi

competingcomplementaryindependent

target product

Figure 9: Intertwined relationships among products.

other product (e.g., HP printer) and the promotion ofPC is said to be complementary to that of HP printer.

In this section, we will study the information diffusion prob-lem in online social networks, where multiple products arebeing promoted simultaneously. The relationships amongthese product can be obtained in advance via effective mar-ket research, which can be independent, competitive or com-plementary. A novel information diffusion model interTwinedLinear Threshold (Tlt) will be introduced in this section.Tlt quantifies the impacts among products with the inter-twined threshold updating strategy and can handle the inter-twined diffusions of these products at the same time.

Diffusion Setting Description and Concept Defini-tion

Definition 41. (Social Network): An online social net-work can be represented as G = (V, E), where V is the setof users and E contains the interactions among users in V.The set of n different products to be promoted in network Gcan be represented as P = p1, p2, · · · , pn.

Definition 42. (User Status Vector): For a given prod-uct pj ∈ P, users who are influenced to buy pj are definedto be “active” to pj, while the remaining users who have notbought pj are defined to be “inactive” to pj. User ui’s statustowards all the products in P can be represented as “userstatus vector” si = (s1

i , s2i , · · · , sni ), where sji is ui’s status

to product pj. Users can be activated by multiple productsat the same time (even competing products), i.e., multipleentries in status vector si can be “active” concurrently.

Definition 43. (Independent, Competing and Comple-mentary Products): Let P (sji = 1) (or P (sji ) for simplicity)denote the probability that ui is activated by product pj andP (sji |s

ki ) be the conditional probability given that ui has been

activated by pk already. For products pj , pk ∈ P, the pro-motion of pk is defined to be (1) independent to that of pj

if ∀ui ∈ V, P (sji |ski ) = P (sji ), (2) competing to that of pj if

∀ui ∈ V, P (sji |ski ) < P (sji ), and (3) complementary to that

of pj if ∀ui ∈ V, P (sji |ski ) > P (sji ).

TLT Diffusion Model

To depict the intertwined diffusions of multiple indepen-dent/competing/complementary products, a new informa-tion diffusion model Tlt is introduced in [134]. In the ex-istence of multiple products P, user ui’s influence to hisneighbor uk in promoting product pj can be represented aswji,k ≥ 0. Similar to the traditional LT model, in Tlt, theinfluence of different products can propagate within the net-work step by step. User ui’s threshold for product pj can be

39

represented as θj and ui will be activated by his neighborsto buy product pj if ∑

ul∈Γout(ui)

wjl,i ≥ θji . (218)

Different from traditional LT model, in Tlt, users in on-line social networks can be activated by multiple productsat the same time, which can be either independent, compet-ing or complementary. As shown in Figure 9, we observethat users’ chance to buy the HP printer will be (1) un-changed given that they have bought Pepsi (i.e., the inde-pendent product of HP printer), (2) increased if they ownPCs (i.e., the complementary product of HP printer), and(3) decreased if they already have the Canon printer (i.e.,the competing product of HP printer).

To model such a phenomenon in Tlt, the following in-tertwined threshold updating strategy has been introducedin [134], where users’ thresholds to different products willchange dynamically as the influence of other products prop-agates in the network.

Definition 44. (Intertwined Threshold Updating Strat-egy): Assuming that user ui has been activated by m prod-ucts pτ1 , pτ2 , · · · , pτm ∈ P \ pj in a sequence, then ui’sthreshold towards product pj will be updated as follows:

(θji )τ1 = θji

P (sji )

P (sji |sτ1i )

, (θji )τ2 = (θji )

τ1 P (sji |sτ1i )

P (sji |sτ1i , s

τ2i )

, · · ·

(219)

(θji )τm = (θji )

τm−1P (sji |s

τ1i , · · · , s

τm−1

i )

P (sji |sτ1i , · · · , s

τm−1

i , sτmi ), (220)

where (θji )τk denotes ui’s threshold to pj after he has been

activated by pτ1 , pτ2 , · · · , pτk , k ∈ 1, 2, · · · ,m.

In this section, we do not focus on the order of productsthat activate users [17] and to simplify the calculation of thethreshold updating strategy, we assume only the most recentactivation has an effect on updating current thresholds, i.e.,

P (sji |sτ1i , · · · , s

τm−1

i )

P (sji |sτ1i , · · · , s

τm−1

i , sτmi )≈ P (sji )

P (sji |sτmi )

= φτm→ji . (221)

Definition 45. (Threshold Updating Coefficient): Term

φl→ji =P (s

ji )

P (sji |s

li)

is formally defined as the “threshold updat-

ing coefficient” of product pl to product pj for user ui, where

φl→ji

< 1, if pl is complementary to pj ,

= 1, if pl is independent to pj ,

> 1, if pl is competing to pj .

(222)

The intertwined threshold updating strategy can be rewrittenbased on the threshold updating coefficients as follows:

(θji )τm ≈ θji · φ

τ1→ji · φτ2→ji · · ·φτm→ji . (223)

7.2.2 Diffusion Models for Signed NetworksIn recent years, signed networks [147; 100] have gained in-creasing attention because of their ability to represent di-verse and contrasting social relationships. Some examplesof such contrasting relationships include friends vs enemies[112], trust vs distrust [114], positive attitudes vs negative

social linkstrust link

distrust link

user states

distrust

trust

inactive

unknown

Figure 10: Example of the information diffusion problem insigned networks.

attitudes [115], and so on. These contrasting relationshipscan be represented as links of different polarities, which re-sult in signed networks. Signed social networks can providea meaningful perspective on a wide range of social networkstudies, like user sentiment analysis [111], social interactionpattern extraction [57], trustworthy friend recommendation[56], and so on.

Information dissemination is common in social networks [86].Due to the extensive social links among users, informationon certain topics, e.g., politics, celebrities and product pro-motions, can propagate leading to a large number of nodesreporting the same (incorrect) observations rapidly in onlinesocial networks. In particular, the links in signed networksare of different polarities and can denote trust and distrustrelationships among users [59], which will inevitably have animpact on information propagation.

In Figure 10, an example is provided to help illustrate the in-formation diffusion problem in signed networks more clearly.In the example, users are connected to one another withsigned links, depending on their trust and distrust relations.It is noteworthy that the conventions used for the directionof information diffusion in this network are slightly differentfrom traditional influence analysis, because they representsigned links. For instance, if Alice trusts (or follows) Bob,a directed edge exists from Alice to Bob, but the informa-tion diffusion direction will be from Bob to Alice. Via thesigned links, inactive users in the network can get infectedby certain information propagated from their neighbors witheither a positive or negative opinion about the information(i.e., the green or red states in the figure). Considering thefact that it is often difficult to directly identify all the userinfection states in real settings, we allow for the possibilityof some user states in the network to be unknown. Acti-vated users can propagate the information to other users.In general, if a user is activated with a positive or nega-tive opinion about the information, she might activate oneor more of her incoming neighbors to trust or distrust theinformation, depending on the sign of the incoming link.

In this subsection, we will focus on studying the informa-tion diffusion problem in signed networks. The edges in thenetwork are directed and signed, and they represent trustor distrust relationships. For example, when node i trustsor distrusts node j, we will have a corresponding positiveor negative link from node i to node j. In this setting,nodes are associated with states corresponding to a prevail-ing opinion about the truth of a fact. These states can bedrawn from −1,+1, 0, ?, where +1 indicates their agree-ment with a specific fact, −1 indicates their disagreement,0 indicates the fact that they have no opinion of the fact athand, and ? indicates their opinion is unknown. The last of

40

these states is necessary to model the fact that the statesof many nodes in large-scale networks are often unknown.Note that the use of multiple states of nodes in the networkis different from traditional influence analysis. Users are in-fluenced with varying opinions of the fact in question, basedon their observation of their neighbors (i.e., states of neigh-borhood nodes), and their trust or distrust of their neigh-bor’s opinions (i.e., signs of links with them). This model isessentially a signed version of influence propagation models,because the sign of the link plays a critical role in how aspecific bit of information is transmitted.

Most existing information diffusion models are designed forunsigned networks. In signed networks, information diffu-sion is also related to actor-centric trust and distrust, inwhich notions of node states and the signs on links play animportant role. To depict how information propagates in thesigned networks, a new diffusion model, namely asyMmetricFlipping Cascade (MFC), has been introduced for signednetworks in [124].

Traditional social networks are unsigned in the sense thatthe links are assumed, by default, to be positive links. Signedsocial networks are a generalization of this basic concept.

Definition 46. (Weighted Signed Social Network): For-mally, a weighted signed social network can be representedas a graph G = (V, E , s, w), where V and E represents thenodes (users) and directed edges (social links), respectively.In signed networks, each social link has its own polarity (i.e.,the sign) and is associated with a weight indicating the inti-macy among users, which can be represented with the map-pings s : E → −1,+1 and w : E → [0, 1] respectively.

As discussed in before, we interpret the signs from a trust-centric point of view. Information propagated among usersis highly associated with the intimacy scores [137] amongthem: information tends to propagate among close users. Torepresent the information diffusion process in trust-centricnetworks, the concept of weighted signed diffusion networkwas defined as follows:

Definition 47. (Weighted Signed Diffusion Network): For-mally, given a signed social network G, its correspondingweighted signed diffusion network can be represented as GD =(VD, ED, sD, wD), where VD = V and ED = (v, u)(u,v)∈E .Diffusion links in ED share the same sign and weight map-pings as those in E, which can be obtained via mappingssD : ED → −1,+1, sD(v, u) = s(u, v),∀(v, u) ∈ ED andwD : ED → [0, 1], wD(v, u) = w(u, v), ∀(v, u) ∈ ED. For anydirected diffusion link (u, v) ∈ ED, we can represent its signand weight to be sD(u, v) and wD(u, v) respectively.

Note that we have reversed the direction of the links becauseof the trust-centric interpretation, in which information dif-fuses from A to B, when B trusts A. However, in networkswith other semantic interpretations, this reversal does notneed to be performed. The overall algorithm is agnostic tothe specific preprocessing performed in order to fit a partic-ular semantic interpretation of the signed network.

MFC Diffusion Model

The IC model, which assumes that social links are all of thesame polarity, works for unsigned networks, but it cannotbe applied to signed networks with node states to reflectsbeliefs of different polarities. To overcome such a shortcom-ing, a novel diffusion model, MFC, will be introduced in thissection.

Algorithm 11 MFC Information Diffusion Model

Input: input rumor initiators I with states Sdiffusion network GD = (VD, ED, sD, wD)

Output: infected diffusion network GI1: initialize infected user set U = I, state set SU = S2: let recently infected user set R = I3: while R 6= ∅ do4: new recently infected user set N = ∅5: for u ∈ R do6: let the set of users that u can activate to be Γ(u)7: for v ∈ Γ(u) do8: if s(v) = 0 or

(sD(u, v) = +1 and s(u) 6= s(v)

)then

9: if sD(u, v) = +1 then10: p = min1.0, α · wD(u, v)11: else12: p = wD(u, v)13: end if14: if u activates v with probability p then15: U = U ∪v, SU = SU ∪s(v) = s(u) · sD(u, v)

16: N = N ∪ v17: end if18: end if19: end for20: end for21: R = N22: end while23: extract infected diffusion network GI consisting of infected

users U

The signs associated with diffusion links denote the “posi-tive” and “negative” relationships, e.g., trust and distrust,among users. In everyday life, people tend to believe infor-mation from people they trust and not believe the informa-tion from those they distrust. For example, if someone wetrust says that “Hillary Clinton will be the new president”,we believe it to be true. However, if someone we distrust saysthe same thing, we might not believe it. In addition, whenreceiving contradictory messages, information obtained fromthe trusted people is usually given higher weights. In otherwords, the effects of trust and distrust diffusion links areasymmetric in activating users. For instance, when variousactors assert that “Hillary Clinton will be the new presi-dent”, we may tend to follow those we trust, even thoughthe distrusted ones also say it. In addition, if someone wedistrust says that “Hillary Clinton will be the new presi-dent”, we may think it to be false and will not believe it.However, after being activated to distrust it, if we are ex-posed to contradictory information from a trusted party, wemight be willing to change our minds. To model such cases,which are unique to signed and state-centric networks, [124]proposes to follow a number of basic principles in the MFCmodel, (1) the effects of positive links in activating usersis boosted to give them higher weights in activating users,and (2) users who are activated already will stay active inthe subsequential rounds but their activation states can beflipped to follow the people they trust.

In MFC, users have 3 unique known states in the informa-tion diffusion process: +1,−1, 0 (i.e., trust, distrust andinactive respectively). Users with unknown states are auto-matically taken into account during the model constructionprocess by assuming states as necessary. For simplicity, weuse s(·) to represent both the sign of links as well as thestates of users. If user u trusts the information, then user uis said to have a positive state s(u) = +1 towards the infor-mation. The initial states of all users in MFC are assigneda value of 0 (i.e., inactive to the information). A set of in-

41

+

_ newly activated

user

users whohave beenactivated already

G

F

H

newly activated users

+__

_

A

B C D E

simultaneous activation sequential activation

A

B C D E

+___

_flip statuswith certainprobabilities

GG’

F

H+

IC M

odel

MFC

Mod

el

Figure 11: Example of the binary tree transformation.

formation seed users I ⊆ V activated by the information atthe very beginning will have their own attitudes towards theinformation based on their judgements, which can be rep-resented with S = +1,−1|I|. Information seed users inI spread the information to other users in signed networksstep by step. At step τ , user u (activated at τ − 1) is givenonly one chance to activate (1) inactive neighbor v, as wellas (2) active neighbor v but v has different state from u andv trusts u, with the boosted success probability wD(u, v),where wD(v, u) ∈ [0, 1] can be represented as

wD(v, u) =

minα · wD(v, u), 1 if sD(v, u) = +1,

wD(v, u), otherwise.

(224)In the above equation, parameter α > 1 denotes the boost-ing of information from u to v and is called the asymmetricboosting coefficient.

If u succeeds, v will become active in step τ+1, whose statescan be represented as s(v) = s(u) · s(u, v). For example, ifuser u thinks the information to be real (i.e., s(u) = +1)and v trusts u (i.e., s(u, v) = +1), once v get activated by usuccessfully, the state of v will be s(v) = +1 (i.e., believe theinformation to be true). Otherwise, v will keep its originalstate (either inactive or activated) and u cannot make anyfurther attempts to activate v in subsequent rounds. Allactivated users will stay active in the following rounds andthe process continues until no more activations are possible.

MFC can model the information diffusion process in signedsocial networks much better than traditional diffusion mod-els, such as IC. To illustrate the advantages of MFC, wealso give an example in Figure 11, where two different cases:“simultaneous activation” (i.e., the left two plots) and “se-quential activation” (i.e., the right two plots) are shown. Inthe “simultaneous activation” case, multiple users (B, C,D and E) are all just activated at step τ , who all think ainformation to be true and at step τ + 1, B-E will activatetheir inactive neighbor A. Among these users, A trusts Eand distrusts the remaining users. In traditional IC models,signs on links are ignored and B-E are given equal chanceto activate A in random order with activation probabilitieswD(·, A), · ∈ B,C,D,E. However, in the MFC model,signs of links are utilized and the activation probability ofpositive diffusion (E,A) will be boosted and can be rep-resented as minα · wD(E,A), 1. As a result, user A ismore likely to be activated by E in MFC. Meanwhile, inthe sequential activation case, once a user (e.g., F ) succeedsin activating G, G will remain active and other users (e.g.,

H) cannot reactivate A any longer in traditional IC model.However, in the MFC model, we allow users to flip theiractivation state by people they trust. For example, if Ghas been activated by F with state s(G) = −1 already, thetrusted user H can still have the chance to flip G’s statewith probability minα · wD(H,G), 1. The pseudo-code ofthe MFC diffusion model is provided in Algorithm 11.

7.3 Inter-Network Information Diffusion viaNetwork Coupling

The information diffusion models introduced in the previoussections are mostly based one single network, assuming thatinformation will only propagate within the network only.However, in the real-world, users are involved in multiplesocial sites simultaneously, and cross-site information diffu-sion is happening all the time. Users as the bridges, theycan receive information from one social sites, and share withtheir friends in another network. Sometimes, due to the so-cial network settings, the activities happing in one social site(e.g., Foursquare) can be reposted to other social sites (e.g.,Twitter) automatically.

In this section and the following two sections, we will studythe information diffusion across multiple social sites. Severaldifferent existing cross-network information diffusion mod-els will be introduced. Generally, different networks willgreat different information diffusion sources, and interac-tions available among users in each of the sources can allpropagate information among users. Two cross-network in-formation diffusion models based on network coupling andrandom walk. Meanwhile, in each of the diffusion sources,there usually exist different types of diffusion channels, sinceusers can interact with each other via different types of ser-vices provided by the network service providers. A new dif-fusion model named Muse will introduced to depict howinformation belonging to different topics diffuses via multi-ple channels across multiple sources.

Cross-network information sharing and reposting rendersthe inter-network information diffusion ubiquitous and verycommon in the real-world online social networks. By involv-ing in multiple online social networks simultaneously, userscan also be exposed to more information from multiple socialsites at the same time. Generally, once a user has been ac-tivated in one of the social site, the user account owner willreceive the information and can diffuse it to other users inthe other networks. The network coupling model proposesto combine multiple social networks together, and treat theinformation diffusion in each of the networks independently.

7.3.1 Single Network Diffusion ModelFormally, let G(1), G(2), · · · , G(k) denote the k online so-cial networks that we are focusing on in the informationdiffusion model, whose network structures are all homoge-neous involving users and friendship links only. For eachof the network, e.g., G(i), we can represent its structure asG(i) = (V(i), E(i)), where V(i) denotes the set of users in the

network. Information diffusion process in network G(i) canbe modeled with some existing models. In this part, we willuse the LT model as the base diffusion model for each of thenetworks.

Based on network G(i), each user u in the network is asso-

ciated with a threshold θ(i)u indicating the minimal amount

of required information to activate the users. Meanwhile,the amount of information sent between the users (e.g., u

42

and v) can be denoted as weight w(i)u,v, whose value can be

determined in the same way as the LT model introducedbefore. For an inactive user u, he/she can be activated iffthe amount of information propagated from their friends isgreater than u’s threshold, i.e.,∑

v∈Γ(u)

I(v, t) · w(i)v,u ≥ θ(i)

u , (225)

where Γ(u) represents the neighbors of user u and I(v, t)indicates whether v has been activated or not at time t.

7.3.2 Network Coupling SchemeGenerally, in the real world, among these k different onlinesocial sites G(1), G(2), · · · , G(k), if there exists one networkG(i), in which the above equation holds, user u will becomeactive. In other words, to determine whether user u hasbeen activated or not, we need to check his/her status inall these k networks one by one. To reduce the activationchecking works, in the lossy network coupling scheme, theactivation checking criterion is relaxed to

k∑i=1

α(i) ·∑

v∈Γ(u)

I(v, t) · w(i)v,u ≥

k∑i=1

α(i) · θ(i)u , (226)

where α(1), α(2), · · · , α(k) > 0 denote the parameters repre-senting the importance of different networks.

Theorem 1. Given the k networks, G(1), G(2), · · · , G(k),if equation

k∑i=1

α(i) ·∑

v∈Γ(u)

I(v, t) · w(i)v,u ≥

k∑i=1

α(i) · θ(i)u , (227)

holds, user u will be activated.

Proof. The theorem can be proven with by contradic-tion. Let’s assume the equation holds but u has not beenactivated in networks G(1), G(2), · · · , G(k), then we have∑

v∈Γ(u)

I(v, t) · w(i)v,u < θ(i)

u , (228)

hold for all these networks.

By times both sides of the inequality with a positive weightα(i), and sum the equations across all these k networks, wehave

k∑i=1

α(i) ·∑

v∈Γ(u)

I(v, t) · w(i)v,u <

k∑i=1

α(i) · θ(i)u , (229)

which contradicts the equation in the theorem.

Therefore, if the new activation criterion holds, user u willbe activated.

The relaxed activation criterion is actually a sufficient butnot necessary condition when determining whether u is acti-vated or not. In some cases, u has already been activated insome of the networks, but the criterion cannot meet, whichwill lead to some latency in status checking. One way tosolve the problem is to assign an appropriate weight α(i) by

increasing its value proportion to∑v∈Γ(u) I(v, t)·w(i)

v,u−θ(i)u .

In the special case that user u can be activated in networkG(i) already, we can assign the weight α(i) with a very largevalue, where α(i) α(j), j ∈ 1, 2, · · · , k, j 6= i is way

larger compared with the remaining networks. So far, theredon’t exist any methods to adjust the parameters automat-ically, and heuristics are applied in most of the cases.

7.4 Random Walk based Diffusion ModelDifferent online social networks usually have their own char-acteristics, and users tend to have different status regardingthe same information. For instance, information about per-sonal entertainments (like movies, pop stars) can be widelyspread among users in Facebook, and users interested inthem will be activated very easily and also share the infor-mation to their friends. However, such a kind of informationis relatively rare in the professional social network LinkedIn,where people seldom share personal entertainment to theircolleagues, even though they may have been activated al-ready in Facebook. What’s more, the structures of theseonline social networks are usually heterogeneous, contain-ing many different kinds of connections. Besides the directfollow relationships among the users, these diverse connec-tions available among the users may create different types ofcommunication channels for information diffusion. To modelsuch an observation in information diffusion across multipleheterogeneous online social sites, in this part, we will intro-duce a new information diffusion model, IPath, based onrandom walk.

7.4.1 Intra-Network PropagationThe traditional research works on homogeneous networks as-sume that information can only be spread by the social linksamong users. If user v follows user u, (v, u) ∈ E (where Eis the edge set), the message can spread from u to v, i.e.u → v. However in a heterogeneous network, multi-typedand interconnected entities, such as images, videos and loca-tions, can create various information propagation relationsamong users. For instance, if user u recommends a goodrestaurant to his friend v by checking in at this place, in-formation will flow from u to v through the location entity

l, which can be expressed by ucheck−in−−−−−−→

lv. Similarly, we

can represent the information diffusion routes among usersvia other information entities, which can be formally repre-sented as the diffusion route set R = r1, r2, . . . , rm, wherem is the route number.

According to each diffusion route, we can represent the con-nections among users as an adjacency matrix. We can takethe source network G(s) = (V(s), E(s)) as an example. Forany diffusion route ri ∈ R, the adjacency matrix of ri will be

A(s)i ∈ R|V

(s)|×|V(s)|, where A(s)i (u, v) is a binary-value vari-

able and A(s)i (u, v) = 1 iff u and v are connected with each

other via relation ri. The weighted diffusion matrix can be

represented as the normalization of W(s)i = A

(s)i D−1, where

D−1 is a diagonal matrix with D(u, u) =∑|V(s)|v A

(s)i (v, u),

denoting the in-degree of u. The entry W(s)i (u, v) denotes

the probability of going from v to u in one step. In a sim-ilar way, we can represent the weighted diffusion matricesfor other relations, which altogether can be represented as

W(s)1 ,W

(s)2 , . . . ,W

(s)m . To fuse the information diffused

from different relations, IPath will linearly combine theseweighted matrices as follows:

Ws = λ1 ×W(s)1 + λ2 ×W

(s)2 + · · ·+ λm ×W(s)

m , (230)

where λi denotes the aggregation weight of matrix corre-

43

|Vs|

|Vt|

αWs

αWt(1-α)Wt→s

(1-α)Ws→t

|Vs| |Vt|

πs

πt

π

Figure 12: The weight matrix and the information distribu-tion vectorsponding to relation ri. In real scenarios, different relationsplay different roles in the information propagation for dif-ferent users. However, to simplify the settings, in IPath,all these relations are treated to be equally important, andthe aggregated matrix W(s) takes the average of all theseweighted diffusion matrices. In a similar way, we can definethe weight matrix W(t) of the target network G(t).

7.4.2 Inter-Network PropagationAcross the aligned networks, information can propagate notonly within networks but also across networks. Based on theknown anchor links between networks G(t) and G(s), i.e., setA(s,t), we can define the binary adjacency matrix A(s→t) ∈R|V

(s)|×|V(t)|, where A(s→t)(u, v) = 1 if (u(s), v(t)) ∈ A(s,t).

IPath assumes that each anchor user in G(s) only has onecorresponding account in G(t). Therefore A(s→t) has beennormalized and the weight matrix W(s→t) = A(s→t), de-noting the chance of information propagating from G(s) toG(t). Furthermore, we can represent the weighted diffusionmatrix from networks G(t) to G(s) as W(t→s) = (W(s→t))>,considering that the anchor links are undirected.

7.4.3 The IPATH Information Propagation ModelBoth the intra-network propagation relations, representedby weight matrices W(s) and W(t) in networks G(s) and G(t)

respectively, and the inter-network propagation relations,represented by weight matrix W(s→t) and W(t→s), havebeen constructed already in the previous subsection. Asshown in Figure 12, to model the cross-network informationdiffusion process involving both the intra- and inter-networkrelations simultaneously, IPath proposes to combine theseweighted diffusion matrices to build an integrated matrix

W ∈ R(|V(s)|+|V(t)|)2 . In the integrated matrix W, the pa-rameter α ∈ [0, 1] denotes the probability that the messagestay in the original network, thus 1−α represents the chanceof being transmitted across networks (i.e., the probability ofactivated anchor user passing the influence to the target net-work). In real scenarios, the probabilities for different usersto repost information across aligned networks can be quitediverse. However, to simplify the problem setting, in IPath,these probabilities are unified with parameter α.

Let vector πk ∈ R(|V(s)|+|V(t)|) represent the informationthat users in G(s) and G(t) can receive after k steps. Asshown in Figure 12, vector πk consists of two parts πk =

[π(s)k , π

(t)k ], where π

(s)k ∈ R|V

(s)| and π(t)k ∈ R|V

(t)|. Theinitial state of the vector can be denoted as π0, which isdefined based on the seed user set Z with function g(·) asfollows:

π0 = g(Z),where π0[u] =

1 if u ∈ Z,0 otherwise.

(231)

Seed set Z can also be represented as Z = g−1(π0). Users

from G(s) and G(t) both have the chance of being selectedas seeds, but when the structure information of G(t) is hardto obtain, the seed users will be only chosen from G(s). InIPath, the information diffusion process is modeled by ran-dom walk, because it is widely used in which the total prob-ability of the diffusing through different relations remainsconstant 1 [103; 33]. Therefore, in the information propa-gation process, vector π will be updated stepwise with thefollowing equation:

π(k+1) = (1− α)×Wπk + α× π0, (232)

where constant α denotes the probability of returning to theinitial state. By keeping updating π according to (232) untilconvergence, we can present the stationary state of vector πto be π∗,

π∗ = α[I− (1− α)W]−1π0, (233)

where matrix I ∈ 0, 1(|V(s)|+|V(t)|)×(|V(s)|+|V(t)|) is an iden-

tity matrix. The value of entry π∗[u] denotes the activationprobability of u, and user u will be activated if π∗[u] ≥ θ,where θ denotes the threshold of accepting the message.In IPath, parameter θ is randomly sampled from range[0, θbound]. The threshold bound θbound is a small constantvalue, as the amount of information each user can get atthe stationary state in IPath can be very small (which isset as 0.01 in the experiments). In addition, we can furtherrepresent the activation status of user u as vector π′, where

π′[u] =

1 if π∗[u] ≥ θ,0 otherwise.

(234)

In Equation (234), π′[u] = 1 denotes that user u is activated.In practice, the value of π∗[u] is usually in [0, 1] when thenetworks are sparse and the size of the seed set is small, andit can be represented approximately as following:

π′[u] ≈ bπ∗[u]− θ + 1c. (235)

Based on this, we define the mapping function h between twovectors, where the floor function is applied to each elementin the vector, i.e.,

π′ = h(π∗) = bπ∗ + cc, (236)

where c is a constant vector where each entry equals to1 − θ. To calculate the final number of activated users inG(t), we define a (|V(s)|+ |V(t)|)-dimension constant vector

b = [0, 0, · · · , 0, 1, 1, · · · , 1], where the number of 0 is |V(s)|and the number of 1 is |V(t)|. Thus the influence functionof the IPath model can be denoted as

σ(Z) = b · h(π∗) = b · h(a[I − (1− a)W]−1 · g(Z)

),

(237)which can effectively compute the number of users who couldbe activated by the model based on the seed user set Z.

7.5 MUSE Model across Online and OfflineWorld

Besides the online world, information can actually propa-gate within the online and offline world simultaneously. Inthis section, we will use the workplace as one example toillustrate the information diffusion via both the online andoffline world simultaneously. On average, people nowadaysneed to spend more than 30% of their time at work everyday.

44

write

follow

follow

follow

reply

like

notify

User

Post

peers

manager-subordinate

skip-levelmanager-subordinate

Online ESN Organizational Chart

follow link write link like link reply link

notify link information diffusion channelssupervision link

Figure 13: An example of information diffusion at work-place.

According to the statistical data in [46], the total amount oftime people spent at workplace in their life is tremendouslylarge. For instance, a young man who is 20 years old nowwill spend 19.1% of his future time working [46]. Therefore,workplace is actually an easily neglected yet important so-cial occasion for effective communication and informationexchange among people in our social life.

Besides the traditional offline contacts, like face-to-face com-munication, telephone calls and messaging, to facilitate thecooperation and communications among employees, a newtype of online social networks named Enterprise Social Net-works (ESNs) has been launched inside the firewalls of manycompanies [142; 130]. A representative example is Yammer,which is used by over 500, 000 leading businesses aroundthe world, including 85% of the Fortune 5005. Yammerprovides various online communication services for employ-ees at workplace, which include instant online messaging,write/reply/like posts, file upload/download/share, etc. Insummary, the communication means existing among em-ployees at workplaces are so diverse, which can generally bedivided into two categories [105]: (1) offline communicationmeans, and (2) online virtual communication means.

In this section, we will study how information diffuses viaboth online and offline communication means among em-ployees at workplace. To help illustrate the problem moreclearly, we also give an example in Figure 13. The left plotof Figure 13 is about an online ESN, employees in whichcan perform various social activities. For instances, em-ployees can follow each other, can write/reply/like posts on-line, and posts written by them can also @certain employ-ees to send notifications, which create various online infor-mation diffusion channels (i.e., the green lines) among em-ployees. Meanwhile, the relative management relationshipsamong the employees in the company can be representedwith the organizational chart (i.e., the right plot), which isa tree-structure diagram connecting employees via supervi-sion links (from managers to subordinates). Colleagues whoare physically close in the organizational chart (e.g., peers,manager-subordinates) may have more chance to meet in theoffline workplace. For example, subordinates need to reportto their managers regularly, peers may co-operate to finishprojects together, which can form various offline informa-tion diffusion channels (i.e., the red lines) among employeesat workplace.

5https://about.yammer.com/why-yammer/

Definition 48. (Enterprise Social Networks (ESNs)): On-line enterprise social networks are a new type of online socialnetworks used in enterprises to facilitate employees’ commu-nications and daily work, which can be represented as hetero-geneous information networks G = (V, E), where V =

⋃i Vi

is the set of different kinds of nodes and E =⋃j Ej is the

union of complex links in the network.

In this section, we will use Yammer as an example of onlineESNs. Yammer can be represented as G = (V, E), wherenode set V = U∪O∪P and U , O and P are the sets of users,groups and posts respectively; link set E = Es∪Ej∪Ew∪Er∪Eldenoting the union of social, group membership, write, re-ply and like links in Yammer respectively. In this section,we regard different group participation as the target activity,information about which can diffuse among employees at theworkplace. Groups in ESNs are usually of different themes(e.g., new products, state-of-art techniques, daily-life enter-tainments), which are treated as different information topicsin this section.

Definition 49. (Organizational Chart): Organizationalchart is a diagram outlining the structure of an organizationas well as the relative ranks of employees’ positions and jobs,which can be represented as a rooted tree C = (N ,L, root),where N denotes the set of employees and L is the set ofdirected supervision links from managers to subordinates inthe company, root usually represents the CEO by default.

Each employee in the company can create exactly one ac-count in Yammer with valid employment ID, i.e., there isone-to-one correspondence between the users in Yammerand employees in the organization chart. For simplicity,in this section, we assume the user set in online ESN to beidentical to the employee set in the organizational chart (i.e.,U = N ) and we will use “Employee” to denote individuals inboth online ESN and offline organizational chart by default.

To address all the above challenges, we will introduce a novelinformation diffusion model Muse (Multi-source Multi-channelMulti-topic diffUsion SElection) proposed in [145]. Museextracts and infers sets of online, offline and hybrid (of on-line and offline) diffusion channels among employees acrossonline ESN and offline organizational structure. Informationpropagated via different channels can be aggregated effec-tively in Muse. Different diffusion channels will be weightedaccording to their importance learned from the social activ-ity log data with optimization techniques and top-K effectivediffusion channels will be selected in Muse finally.

7.5.1 PreliminaryIn this section, a novel information diffusion model Musewill be proposed to depict the information propagation pro-cess of multiple topics via different diffusion channels acrossthe online and offline world at workplace. We denote theset of topics diffusing in the workplace as set T . Threedifferent diffusion sources will be our main focus in thissection: online source, offline source and the hybrid source(across online and offline sources). The diffusion channel

set of all these three sources can be represented as C(on),

C(off) and C(hyb) respectively, whose sizes are∣∣∣C(on)

∣∣∣ =

k(on),∣∣∣C(off)

∣∣∣ = k(off),∣∣∣C(hyb)

∣∣∣ = k(hyb).

In Muse, a set of users are activated initially, whose in-formation will propagate in discrete steps within the net-

45

work to other users. Let v be an employee at workplacewho has been activated by topic t ∈ T . For instance, atstep τ , v will send a amount of w(on),i(v, u, t) informationon topic t to u via the ith channel in the online source(i.e., channel c(on),i ∈ C(on)), where u is an employee fol-

lowing v in channel c(on),i. The amount of informationthat u receives from v via all the channels in the onlinesource at step τ can be represented as vector w(on)(v, u, t) =

[w(on),1(v, u, t), w(on),2(v, u, t), · · · , w(on),k(on)

(v, u, t)]. Sim-ilarly, we can also represent the vectors of information u re-ceives from v through channels in offline source and hybridsource as vectors w(off)(v, u, t) and w(hyb)(v, u, t) respec-tively.

Meanwhile, users in Muse are associated thresholds to dif-ferent topics, which are selected at random from the uniformdistribution in range [0, 1]. Employee u can get activated bytopic t if the information received from his active neighborsvia diffusion channels of all these three sources can exceedhis activation threshold θ(u, t) to topic t,

f(w(on)(·, u, t),w(off)(·, u, t),w(hyb)(·, u, t)

)≥ θ(u, t),

(238)where aggregation function f(·) maps the information u re-ceives from all the channels to u’s activation probability inrange [0, 1]. Here, the vector w(on)(·, u, t) = [w(on),1(·, u, t),w(on),2(·, u, t), · · · , w(on),k(on)

(·, u, t)], where w(on),i(·, u, t) de-notes the information received from all the employees u fol-lows in channel c(on),i, i.e.,

w(on),i(·, u, t) =∑

v∈Γ(on),iout (u)

w(on),i(v, u, t). (239)

Vectors w(off)(·, u, t) and w(hyb)(·, u, t) can be representedin a similar way. Once being activated, a user will stay activein the remaining rounds and each user can be activated atmost once. Such a process will end if no new activations arepossible.

Considering that individuals’ activation thresholds θ(u, t) totopic t is is pre-determined by the uniform distribution, nextwe will focus on studying the information received via chan-nels of the online, offline and hybrid sources and the aggre-gation function f(·) in details.

7.5.2 Online and Offline Diffusion Channels Extrac-tion

Both online ESNs and offline organizational chart providevarious communication means for employees to contact eachother, where individuals who have no social connections canstill pass information via many other connections. Eachconnection among employees can form an information dif-fusion channel across online ESN and offline organizationalchart. In Muse, various diffusion channels among employ-ees will be extracted based on a set of social meta paths [98]extracted across the online and offline world.

In enterprise social networks, individuals can (1) get infor-mation from employees they follow (i.e., their followees) and(2) people that their “followees” follow (i.e., 2nd level fol-lowees), and obtain information from employees by (3) view-ing and replying their posts, (4) viewing and liking theirposts, as well as (5) getting notified by their posts (i.e., ex-plicitly @ certain users in posts). Muse proposes to extract5 different online social meta paths from the online ESN,

whose physical meanings, representations and abbreviatednotations are listed as follows:

• Followee: EmployeeSocial−1←−−−−−−− Employee, whose notation

is Φ1.

• Followee-Followee: Employee Social−1←−−−−−−− Employee Social−1

←−−−−−−−

Employee, whose notation is Φ2.

• Reply Post: EmployeeReply−1←−−−−−−− Post

Write←−−−−− Employee,whose notation is Φ3.

• Like Post: Employee Like−1←−−−−−− Post Write←−−−−− Employee, whose

notation is Φ4.

• Post Notification: Employee Notify←−−−−−− Post Write←−−−−− Employee,whose notation is Φ5.

Meanwhile, in offline workplace, the most common socialinteraction should happen between close colleagues, e.g.,peers, manager-subordinate, and skip-level manager-subordinates,etc. The physical meaning and notations of offline socialmeta paths extracted in this section are listed as follows:

• Manager: EmployeeSupervision←−−−−−−−−−− Employee, whose nota-

tion is Ω1.

• Subordinate: EmployeeSupervision−1←−−−−−−−−−−−− Employee, whose

notation is Ω2.

• Peer: Employee Supervision←−−−−−−−−−− Employee Supervision−1←−−−−−−−−−−−− Employee,

whose notation is Ω3.

• 2nd-Level Manager: Employee Supervision←−−−−−−−−−− EmployeeSupervision←−−−−−−−−−− Employee, whose notation is Ω4.

• 2nd-Level Subordinate: Employee Supervision−1←−−−−−−−−−−−− Employee

Supervision−1←−−−−−−−−−−−− Employee, whose notation is Ω5.

Besides the pure online/offline diffusion channels, informa-tion can also propagate across both online and offline worldsimultaneously. Consider, for example, two employees v andu who are not connected by any diffusion channels in onlineESN or offline workplace, v can still influence u by activat-ing u’s manager via online contacts and the manager willfurther propagate the influence to v via offline interactions.To capture such relationships among the employees, a set ofhybrid social meta path extracted in this Muse, togetherwith their physical meanings, notations are listed as follows:

• Followee-Manager: Employee Social−1←−−−−−−− Employee Supervision←−−−−−−−−−−

Employee, whose notation is Ψ1,

• Followee-Subordinate: Employee Social−1←−−−−−−− Employee

Supervision−1←−−−−−−−−−−−− Employee, whose notation is Ψ2,

• Manager-Followee: Employee Supervision←−−−−−−−−−− Employee Social−1←−−−−−−−

Employee, whose notation is Ψ3,

• Subordinate-Followee: Employee Supervision−1←−−−−−−−−−−−− Employee

Social−1←−−−−−−− Employee, whose notation is Ψ4,

• Followee-Peer: EmployeeSocial−1←−−−−−−− Employee

Supervision←−−−−−−−−−−

EmployeeSupervision−1←−−−−−−−−−−−− Employee, whose notation is Ψ5,

• Peer-Followee: Employee Supervision←−−−−−−−−−− Employee Supervision−1←−−−−−−−−−−−−

EmployeeSocial−1←−−−−−−− Employee, whose notation is Ψ6,

46

The direction of the links denotes the information diffusiondirection and end of the diffusion links (i.e., the first em-ployee of the above paths) represents the target employee toreceive the information. Each of the above social meta pathdefines a information diffusion channel among individualsacross the online and offline world.

Furthermore, let P(on)Φi

(v ·) and P(on)Φi

(· u) be thesets of path instances of Φi going out from v and goinginto u respectively, with which we can define the amount ofinformation propagating from v to u via diffusion channelc(on),i = Φi to be

w(on),i(v, u, t) =2∣∣∣P(on)

Φi(v u)

∣∣∣ · I(v, t)∣∣∣P(on)Φi

(v ·)∣∣∣+∣∣∣P(on)

Φi(· u)

∣∣∣ , (240)

where binary function I(v, t) = 1 if v has been activated bytopic t and 0 otherwise.

Similarly, based on offline social meta path, e.g., Ωi, andhybrid diffusion channel, e.g., Ψi, the amount of informa-tion on topic t propagating from employee v to u can berepresented as follows respectively:

w(off),i(v, u, t) =2∣∣∣P(off)

Ωi(v u)

∣∣∣ · I(v, t)∣∣∣P(off)Ωi

(v ·)∣∣∣+∣∣∣P(off)

Ωi(· u)

∣∣∣ , (241)

w(hyb),i(v, u, t) =2∣∣∣P(hyb)

Ψi(v u)

∣∣∣ · I(v, t)∣∣∣P(hyb)Ψi

(v ·)∣∣∣+∣∣∣P(hyb)

Ψi(· u)

∣∣∣ . (242)

7.5.3 Channel AggregationDifferent diffusion channels deliver various amounts of infor-mation among employees via the online communications inESN and offline contacts. In this subsection, we will focus onaggregating information propagated via different channelswith the information aggregation function f(·) : Rn×1 →[0, 1], which can map the amount of information received byemployees to their activation probabilities. Generally, anyfunction that can map real number to probabilities in range[0, 1] can be applied and without loss of generality, we will

use the logistic function f(x) = ex

1+ex[23] in this section.

Based on the information on topic t received by u via the on-line, offline and hybrid diffusion channels, we can representu’s activation probability to be:

f(w

(on)(·, u, t),w(off)

(·, u, t),w(hyb)(·, u, t)

)(243)

=e

(g(w(on)(·,u,t))+g(w(off)(·,u,t))+g(w(hyb)(·,u,t))+θ0

)

1 + e

(g(w(on)(·,u,t))+g(w(off)(·,u,t))+g(w(hyb)(·,u,t))+θ0

) , (244)

where function g(·) linearly combines the information in dif-ferent channels belonging to certain sources and θ0 denotesthe weight of the constant factor. Terms g(w(on)(·, u, t)),g(w(off)(·, u, t)) and g(w(hyb)(·, u, t)) can be represented asfollows

g(w(on)

(·, u, t)) =k(on)∑i=1

αi ·∑

v∈Γ(on),iout (u)

w(on),i

(v, u, t), (245)

g(w(off)

(·, u, t)) =k(off)∑i=1

βi ·∑

v∈Γ(off),iout (u)

w(off),i

(v, u, t), (246)

g(w(hyb)

(·, u, t)) =k(hyb)∑i=1

γi ·∑

v∈Γ(hyb),iout (u)

w(hyb),i

(v, u, t), (247)

where αi, βi, γi are the weights of different online, offline

and hybrid diffusion channels respectively and∑k(on)

i=1 αi +∑k(off)

i=1 βi +∑k(hyb)

i=1 γi + θ0 = 1. Depending of roles ofdifferent diffusion channels, the weights can be

• > 0, if positive information in the channel will increaseemployees’ activation probability;

• = 0, if positive information in the channel will notchange employees’ activation probability;

• < 0, if positive information in the channel will de-crease employees’ activation probability.

In Muse, weights of certain diffusion channels can be nega-tive. As a result, the likelihood for a node to become activewill no longer grow monotonically in the Muse diffusionmodel. The optimal weights of different diffusion channelscan be learned from the group participation log data (i.e.,the target social activity diffusing at workplace). Differentdiffusion channels will be ranked according to their impor-tance and top-k diffusion channels which can increase indi-viduals’ activation probabilities will be selected in the nextsubsection.

7.5.4 Channel Weighting and SelectionIn Yammer, users can create and join groups of their inter-ests, which can be about very diverse topics, e.g., products(e.g., iPhone, Windows, Android, etc.), people (e.g., BillGates, Leslie Lamport, etc.), projects (e.g., Project Com-plete, Meeting, ect.) and personal life issues (e.g., DiabloGames, Work Life Balance, etc.). The users’ participationin groups log data can be represented as a set of tuples(u, t)u,t, where tuple (u, t) represents that user u gets ac-tivated by topic t (of groups). Such a tuple set can be splitinto three parts according to ratio 3:1:1 in the order of thetimestamps, where 3 folds are used as the training set, 1fold is used as the validation set and 1 fold as the test set.We will use the training set data to calculate the activa-tion probabilities of individuals getting activated by topicsin both the validation set and test set, while validation setis used to learn the weights of different diffusion channelsand test set is used to evaluate the learned model.

Let V = (u, t)u,t be the validation set. Based on theamount of information propagating among employees in theworkplace calculated with the training set, we can infer theprobability of user u’s (who has not been activated yet) getactivated by topic t, for ∀(u, t) ∈ V, which can be rep-

resented with matrix F ∈ R|U|×|T |, where F(i, j) denotesthe inferred activation probability of tuple (ui, tj) in thevalidation set. Meanwhile, based on the validation set it-self, we can get the ground-truth of users’ group participa-tion activities, which can be represented as a binary matrixH ∈ 0, 1|U|×|T |. In matrix H, only entries correspondingtuples in the validation set are filed with value 1 and theremaining entries are all filled with 0. The optimal weightsof information delivered in different diffusion channels (i.e.,α∗, β∗, γ∗, θ∗0) can be obtained by solving the followingobjective function

α∗, β∗, γ∗, θ∗0 = arg min

α,β,γ,θ0‖F−H‖2F (248)

s.t.k(on)∑i=1

αi +k(off)∑i=1

βi +k(hyb)∑i=1

γi + θ0 = 1. (249)

47

The final objective function is not convex and can have mul-tiple local optima, as the aggregation function (i.e., the lo-gistic function) is not convex actually. Muse proposes tosolve the objective function and handle the non-convex is-sue by using a two-stage process to ensure the robust of thelearning process as much as possible.

(1) Firstly, the above objective function can be solved by us-ing the method of Lagrange multipliers [8], where the corre-sponding Lagrangian function of the objective function canbe represented as

L(α, β, γ, θ0, η) (250)

= ‖F−H‖2F + η(k(on)∑i=1

αi +k(off)∑i=1

βi +k(hyb)∑i=1

γi + θ0 − 1), (251)

= Tr(FF> − FH

> −HF>

+ HH>

) (252)

+ η(

k(on)∑i=1

αi +

k(off)∑i=1

βi +k(hyb)∑i=1

γi + θ0 − 1). (253)

By taking the partial derivatives of the Lagrange functionwith regards to variable αi, i ∈ 1, 2, · · · , k(on), we can get

∂L(α, β, γ, θ0, η)

∂αi(254)

=∂Tr(FF>)

∂αi−∂Tr(FH>)

∂αi−∂Tr(HF>)

∂αi+∂Tr(HH>)

∂αi(255)

+∂η(∑k(on)

i=1 αi +∑k(off)

i=1 βi +∑k(hyb)

i=1 γi + θ0 − 1)

∂αi. (256)

Term

∂η(∑k(on)

i=1 αi +∑k(off)

i=1 βi +∑k(hyb)

i=1 γi + θ0 − 1)

∂αi= η (257)

∂Tr(FF>)

∂αi=

|U|∑j=1

|T |∑l=1

∂F2(j, l)

∂αi

|U|∑j=1

|T |∑l=1

(2f(w

(on)(·, uj , tl), (258)

w(off)

(·, uj , tl),w(hyb)(·, uj , tl)

))·( ey

(1 + ey)2·∂y

∂αi

), (259)

where the introduced term y denotes y = g(w(on)(·, uj , tl))+

g(w(off)(·, uj , tl))+g(w(hyb)(·, uj , tl))+θ0 and its derivative

is ∂y∂αi

=∂g(w(on)(·,uj ,tl))

∂αi=∑v∈Γ

(on),iout (u)

w(on),i(v, uj , tk).

Similarly, we can obtain terms ∂Tr(FH>)∂αi

, ∂Tr(HF>)∂αi

, and

∂Tr(HH>)∂αi

. By making ∂L(α,β,γ,θ0,η)∂αi

= 0, we can obtain

an equation involving variables αi, βi, γi, θ0 and η. Fur-thermore, we can calculate the partial derivatives of the La-grange function with regards to variable βi, γi, θ0 and ηrespectively and make the equation equal to 0, which willlead to an equation group about variables αi, βi, γi, θ0 andη. The equation group can be solved with open source toolk-its, e.g., SciPy Nonlinear Solver6, effectively. By giving thevariables with different initial values, multiple solutions (i.e.,multiple local optimal points) can be obtained by resolvingthe objective function.

(2) Secondly, the local optimal points obtained are furtherapplied to the objective function and the one achieving thelowest objective function value is selected as the final results(i.e., the weights of different channels).

According to the learned weights, different diffusion chan-nels can be ranked according to their importance in deliv-ering information to activate employees in the workplace.

6http://docs.scipy.org/doc/scipy-0.14.0/reference/optimize.nonlin.html

Considering that, some diffusion channels may not performvery well in information propagation (e.g., those with neg-ative or zero learned weights), top-k channels that can in-crease employees’ activation probabilities are selected as theeffective channels used in Muse model finally. In otherwords, k equals to the number of diffusion channels withpositive weights learnt from the above objective function.Such a process is formally called diffusion channel weightingand selection in this section. The rational of channel weight-ing and selection is that: among all the diffusion channels,some channels can be useful but some may be not. 3 dif-ferent sets of diffusion channels are introduced in previoussections and we want to select the good ones.

8. NETWORK EMBEDDINGIn the era of big data, information from diverse disciplinesis generated at an extremely fast pace, lots of which arehighly structured and can be represented as massive andcomplex networks. The representative examples include on-line social networks, like Facebook and Twitter, academicretrieval sites, like DBLP and Google Scholar, as well asbio-medical data, e.g., human brain networks. These net-works/graphs are usually very challenging to handle due totheir extremely large scale (involving millions even billions ofnodes), complex structures (containing heterogeneous links)as well as the diverse attributes (attached to the nodes orlinks). For instance, the Facebook social network involvesmore than 1 billion active users; DBLP contains about 2.8billions of papers; and human brain has more than 16 billionof neurons.

Great challenges exist when handling these network struc-tured data with traditional machine learning algorithms,which usually take feature vector representation data asthe input. A general representation of heterogeneous net-works as feature vectors is desired for knowledge discoveryfrom such complex network structured data. In recent years,many research works propose to embed the online social net-work data into a lower-dimensional feature space, in whichthe user node is represented as a unique feature vector, andthe network structure can be reconstructed from these fea-ture vectors. With the embedded feature vectors, classicmachine learning models can be applied to deal with thesocial network data directly, and the storage space can besaved greatly.

In this section, we will talk about the network embeddingproblem, aiming at projecting the nodes and links in thenetwork data in low-dimensional feature spaces. Depend-ing on the application setting, exist graph embedding workscan be categorized into the embedding of homogeneous net-works, heterogeneous networks, and multiple aligned hetero-geneous networks. Meanwhile, depending on the models be-ing applied, current embedding works can be divided intothe matrix factorization based embedding, translation basedembedding, and deep learning architecture based embedding.

In the following parts in this section, we will first intro-duce the translation based graph embedding models in Sec-tion 8.1, which are mainly proposed for the multi-relationalknowledge graphs, including TransE [11], TransH [109] andTransR [62]. After that, in Section 8.2, we will introducethree homogeneous network embedding models, includingDeepWalk [78], LINE [102] and node2vec [35]. Two embed-ding models for the heterogeneous networks will be intro-

48

duced in Section 8.3, which projects the nodes to featurevectors based on the heterogeneous information inside thenetworks [14; 16]. Finally, we will talk about the model pro-posed for the multiple aligned heterogeneous network [135]in Section 8.4, where the anchor links are utilized to trans-fer information across different sites for mutual refinementof the embedding results synergistically.

8.1 Relation Translation based Graph EntityEmbedding

Multi-relational data refers to the directed graphs whosenodes correspond to entities and links denote the relation-ships. The multi-relational data can be represented as agraph G = (V, E), where V denotes the node set and Erepresents the link set. For the link in the graph, e.g.,r = (h, t) ∈ E , the corresponding entity-relation can be rep-resented as a triple (h, r, t), where h denotes the link initiatorentity, t denotes the link recipient entity and r represents thelink. The embedding problem studied in this section is tolearn a feature representation of both entities and relationsin the triples, i.e., h, r and t.

Model TransE is the initial translation based embeddingwork, which projects the entity and relation into a commonfeature space. TransH improves TransE by considering thelink cardinality constraint in the embedding process, andcan achieve comparable time complexity. In the real-worldmulti-relational networks, the entities can have multiple as-pects, and the different relations can express different as-pects of the entity. Model TransR proposes to build theentity and relation embeddings in separate entity and rela-tion spaces instead. Next, we will introduce the embeddingmodels TransE, TransH and TransR one by one as follows,where the relation is more like a translation of entities inthe embedding space. It is the reason why these models arecalled the translation based embedding models.

8.1.1 TransEThe TransE [11] model is an energy-based model for learninglow-dimensional embeddings of entities and relations, wherethe relations are represented as the translations of entities inthe embedding space. Given a entity-relation triple (h, r, t),the embedding feature representation of the entities and re-lations can be represented as vectors h ∈ Rk, r ∈ Rk andt ∈ Rk (k denotes the objective vector dimension). If thetriple (h, r, t) holds, i.e., there exists a link r starting fromh to t in the network, the corresponding embedding vectorsh + r should be as close to vector t as possible.

Let S+ = (h, r, t)r=(h,t)∈E represents the set of positivetraining data, which contains the triples existing in the net-works. The TransE model aims at learning the embeddingfeatures vectors of the entities h, t and the relation r, i.e., h,r and t. For the triples in the positive training set, we wantto ensure the learnt embedding vectors h+r is very close tot. Let d(h+r, t) denotes the distance between vectors h+rand t. The loss introduced for the triples in the positivetraining set can be represented as

L(S+) =∑

(h,r,t)∈S+

d(h + r, t). (260)

Here the distance function can be defined in different ways,like the L2 norm of the difference between vectors h+r and

t, i.e.,

d(h + r, t) = ‖h + r− t‖2 . (261)

By minimizing the above loss function, the optimal featurerepresentations of the entities and relations can be learnt.To avoid trivial solutions, like 0s for h, r and t, additionalconstraints that the L2-norm of the embedding vectors of theentities should be 1 will be added in the function. Further-more, a negative training set is also sampled to differentiatethe learnt embedding vectors. For a triple (h, r, t) ∈ S+, thecorresponding sampled negative training set can be denotedas S−(h,r,t), which contains the triples formed by replacing

the initiator entity h or the recipient entity t with randomentities. In other words, the negative training set S−(h,r,t)can be represented as

S−(h,r,t) = (h′, r, t)|h′ ∈ V ∪ (h, r, t′)|t′ ∈ V. (262)

The loss function involving both the positive and negativetraining set can be represented as

L(S+,S−) = (263)∑

(h,r,t)∈S+

∑(h′,r,t′)∈S−

(h,r,t)

max(γ + d(h + r, t)− d(h′ + r, t

′), 0), (264)

where γ is a margin hyperparameter and max(·, 0) will countthe positive loss only.

The optimization is carried out by stochastic gradient de-scent (in minibatch mode). The embedding vectors of en-tities and relationships are initialized with a random pro-cedure. At each iteration of the algorithm, the embeddingvectors of the entities are normalized and a small set oftriplets is sampled from the training set, which will serveas the training triplets of the minibatch. The parametersare then updated by taking a gradient step with constantlearning rate.

8.1.2 TransHTransE is a promising method proposed recently, which isvery efficient while achieving state-of-the-art predictive per-formance. However, in the embedding process, TransE failto consider the cardinality constraint on the relations, likeone-to-one, one-to-many and many-to-many. The TransHmodel [109] to be introduced in this part considers suchproperties on relations in the embedding process. Further-more, different from the other complex models, which canhandle these properties but sacrifice efficiency, TransH achievescomparable time complexity as TransE. TransH models therelation as a hyperplane together with a translation opera-tion on it, where the correlation among the entities can beeffectively preserved.

In TransH, different from the embedding space of entities,the relations, e.g., r, is denoted as a transition vector dr inthe hyperplane wr (a normal vector). For each of the triple(h, r, t), the embedding vector h, t are fist projected to thehyperplane wr, whose corresponding projected vectors canbe represented as h⊥ and t⊥ respectively. The vectors h⊥and t⊥ can be connected by the translation vector dr on thehyperplane. Depending on whether the triple appears in thepositive or negative training set, the distance d(h⊥+dr, t⊥)should be either minimized or maximized.

Formally, given the hyperplane wr, the projection vectors

49

h⊥ and t⊥ can be represented as

h⊥ = h−w>r hwr, (265)

t⊥ = t−w>r twr. (266)

Furthermore, the L2 norm based distance function can berepresented as

d(h⊥ + dr, t⊥) = ‖(h−wrhwr) + dr − (t−wrtwr)‖22 . (267)

The variables to be learnt in the TransH model include theembedding vectors of all the entities, the hyperplane andtranslation vectors for each of the relations. To learn thesevariables simultaneously, the objective function of TransHcan be represented as

L(S+,S−) = (268)∑

(h,r,t)∈S+

∑(h′,r′,t′)∈S−

(h,r,t)

max(γ + d(h⊥ + dr, t⊥)− d(h′⊥ + d

′r, t′⊥), 0

),

(269)

where S−(h,r,t) denotes the negative set constructed for triple

(h, r, t). Different from TransE, TransH applies a differentto sample the negative training triples with considerationsof the relation cardinality constraint. For the relations withone-to-many, TransH will give more chance to replace theinitiator node; and for the many-to-one relations, TransHwill give more chance to replace the recipient node instead.

Besides the loss function, the variables to be learnt are sub-ject to some constraints, like the embedding vector for enti-ties is a normal vector; wr and dr should be orthogonal, andwr is also a normal vector. We summarize the constraintsof the TransH model as follows

‖h‖2 ≤ 1, ‖t‖2 ≤ 1,∀h, t ∈ V, (270)

|w>r dr|‖dr‖2

≤ ε, ∀r ∈ E , (271)

‖wr‖2 ≤ 1,∀r ∈ E . (272)

The constraints can be relaxed as some penalty terms, whichcan be added to the objective function with a relatively largeweight. The final objective function can be learnt with thestochastic gradient descent, and by minimizing the loss func-tion, the model variables can be learned and we will get thefinal embedding results.

8.1.3 TransRBoth TransE and TransH introduced in the previous sub-sections assume embeddings of entities and relations withinthe same space Rk. However, entities and relations are ac-tually totally different objects, and they may be not capableto be represented in a common semantic space. To addresssuch a problem, TransR [62] is proposed, which models theentities and relations in distinct spaces, i.e., the entity spaceand relation space, and performs the translation in relationspace.

In TransR, given a triple (h, r, t), the entities h and t areembedded as vectors h, t ∈ Rke , and the relation r is em-bedded as vector r ∈ Rkr , where the dimension of the entityspace and relation space are not the same, i.e., ke 6= kr.To project the entities from the entity space to the rela-tion space, a projection matrix Mr ∈ Rke×kr is defined inTransR. With the projection matrix, the projected entity

Algorithm 12 DeepWalk

Input: Input homogeneous network G = (V, E)Window size s; Embedding size dWalk length l; Walks per node γ

Output: Matrix of node representations X ∈ R|V|×d1: Initialize X with random values following the uniform distri-

bution2: Build a binary tree T from node set V3: for Round i = 1 to γ do4: O = shuffle(V)5: for Node u ∈ O do6: Wu = WalkGenerator(G, u, l)7: SkipGram(X, Wu, w)8: end for9: end for

10: Return X

embedding vectors can be defined as

hr = hMr, (273)

tr = tMr. (274)

The loss function is defined as

d(hr + r, tr) = ‖hr + r− tr‖22 . (275)

The constraints involved in TransR include

‖h‖2 = 1, ‖t‖2 = 1,∀h, t ∈ V, (276)

‖hMr‖2 = 1, ‖tMr‖2 = 1, ∀h, t ∈ V, (277)

‖wr‖2 ≤ 1, ∀r ∈ E . (278)

The negative training set S− in TransR can be obtained ina similar way as TransH, where the variables can be learntwith the stochastic gradient descent. We will not introducethe information here to avoid content duplication.

8.2 Homogeneous Network EmbeddingBesides the translation based network embedding models,in this section, we will introduce three embedding modelsfor network data, including DeepWalk, LINE and node2vec.Formally, the networks studied in this part are all homoge-neous networks, which is represented as G = (V, E). Set Vdenotes the set of nodes in the homogeneous network, andE represents the set of links among the nodes inside thenetwork.

8.2.1 DeepWalkThe DeepWalk [78] algorithm consists of two main com-ponents: (1) a random walk generator, and (2) an updateprocedure. In the first step, the DeepWalk model randomlyselects a node, e.g., u ∈ V, as the root of a random walkWu from the nodes in the network. Random walk Wu willsample the neighbors of the node last visited uniformly untilthe maximum length l is met. In the second step, the sam-pled neighbors are used to update the representations of thenodes inside the graph, where SkipGram [69] is applied here.

The pseudo code of the DeepWalk algorithm is available inAlgorithm 12, which illustrates the general architecture ofthe algorithm. In the algorithm, line 1 initializes the repre-sentation matrix X for all the nodes, and line 2 builds a bi-nary tree involving all the nodes in the network as the leaves,which will be introduced in more detail in Section 8.2.1.Lines 3-9 denote the main part of the DeepWalk algorithm,where the random walk starting randomly at each node isgenerated for γ times by calling function WalkGenerator.For each node u, a random walk Wu is generated whose

50

length is bounded by parameter l. The random walk willbe applied to update the node representation with the Skip-Gram function to be introduced in Section 8.2.1.

Random Walk Generator

The random walk model has been introduced in Section 5.1.1.Formally, the random walk starting at node u ∈ V can berepresented as Wu, which actually denotes a stochastic pro-cess with random status W 0

u , W 1u , · · · , W k

u . Formally, atthe very beginning, i.e., step 0, the random walk is at theinitial node, i.e., W 0

u = u. The status variable W ku denotes

the node where the node is at step k.

Random walk can capture the local network structures effec-tively, where the neighborhood and social connection close-ness can affect the next nodes that the random walk willmove to in the next step. Therefore, in the DeepWalk, ran-dom walk is applied to sample a stream of short randomwalks as the tool for extracting information from a network.Random walk can provide two very desirable properties, be-sides the ability to capture the local community structures.Firstly, the random walk based local exploration is easy toparallelize. Several random walks can simultaneously ex-plore different parts of the same network in different threads,processes and machines. Secondly, with the information ob-tained from short random walks, it is possible to accom-modate small changes in the network structure without theneed for global recomputation.

SkipGram Technique

The updating procedure used in DeepWalk is very similarto the word appearance prediction in language modeling. Inthis part, we will first provide some basic knowledge aboutlanguage modeling problem first, and then introduce theSkipGram technique.

Formally, the objective of language modeling is to estimatethe likelihood of a specific sequence of words appearing in acorpus. More specifically, given a sequence of words (w1, w2,· · · , wn−1) where word wi ∈ V (V denotes the vocabulary),the word appearing prediction problem aims at inferring theword wn that will appear next. An intuitive idea to modelthe problem is to maximize the estimation likelihood for thenext word wn given w1, w2, · · · , wn−1, and the problem canbe formally represented as

w∗n = argwn∈V P (wn|w1, w2, · · · , wn−1). (279)

where term P (wn|w1, w2, · · · , wn−1) denotes the conditionalprobability of having wn attached to the observed word se-quence w1, w2, · · · , wn−1.

Meanwhile, in neural networks, the words will have a latentrepresentation denoted as vector, like xwi ∈ Rd×1 for wordwi ∈ V. Furthermore, computation of the above conditionalprobability is very challenging, especially as the observedword sequence goes longer, i.e., n is large. Therefore, awindow is proposed to limit the length of word sequence inprobability computation. Term s is denoted as the size ofthe window. Therefore, the above objective function can berewritten as

w∗n = argwn∈V P (wn|xwn−s ,xwn−s+1 , · · · ,xwn−1). (280)

A recent relaxation to the above problem in language mod-eling turns the prediction problem on its head. Three bigchanges are applied to the model: (1) instead of predictingthe objective word with the context, the relaxation predictsthe context with the objective word instead; (2) the context

Algorithm 13 SkipGram

Input: Representations of nodes: XRandom walk starting from node u: Wu

Window size sOutput: Updated matrix of node representations X1: for Each node ui ∈Wu do2: Wu will generate a sampled sequence before and after uj

bounded by window size s: (ui−s, · · · , ui+s)3: for Each node uj ∈ (ui−s, · · · , ui+s) do4: J(X) = − logP (uj |xui )5: X = X− αJ(X)

∂X6: end for7: end for8: Return X

denotes the words appearing before and after the objectiveword limited by the window size s, and (3) the order of wordsis removed and the context denotes a set of words instead.Formally, the objective function can be rewritten as

w∗n = argwn∈V P (wn−s, wn−s+1, · · · , wn+s \ wn|xwn ). (281)

SkipGram is a language model that maximize the co-occurrenceprobability of words appearing in the time window s in a sen-tence. Here, when applying the SkipGram technique to theDeepWalk model, the nodes u ∈ V in the network can beregarded as the words w denoted in the equations aforemen-tioned. Meanwhile, for the nodes sampled by the randomwalk model within the window size s before and after nodev, they will be treated as the words appearing ahead of andafter node v. Furthermore, SkipGram assumes the appear-ance of the words (or nodes for networks) to be independent,and the above probability equations can be rewritten as fol-lows:

P (un−s, un−s+1, · · · , un+s\un|xun ) =

n+s∏i=n−s,i 6=n

P (ui|xun ), (282)

where un−s, un−s+1, · · · , un+s denotes the sequence of nodessampled by the random walk model.

The learning process of the SkipGram algorithm is pro-vided in Algorithm 13, where we will enumerate all the co-locations of nodes in the sampled node series un−s, un−s+1,· · · , un+s by a random walk Wu (starting from node u inthe network). With gradient descent, the representation ofnodes with their neighbors representations can be updatedwith stochastic gradient descent. The derivatives are esti-mated with the back-propagation algorithm. However, inthe equation, we need to have the conditional probabilitiesof the nodes and their representations. A concrete represen-tation of the probability can be a great challenging prob-lem. As proposed in [69], such a distribution can be learntwith some existing models, like logistic regression. However,since the labels used here denote the nodes in the network,it will lead to a very large label space with |V| differentlabels, which renders the learning process extremely timeconsuming. To solve such a problem, some techniques, likeHierarchical Softmax, have been proposed which representsthe nodes in the network as a binary tree and can lower donethe probability computation time complexity from O(|V|) toO(log |V|).Hierarchical Softmax

In the SkipGram algorithm, calculating probability P (ui|xun)is infeasible. Therefore, in the DeepWalk model, hierarchicalsoftmax is used to factorize the conditional probability. Inhierarchical softmax, a binary tree is constructed, where thenumber of leaves equals to the network node set size, and

51

each network node is assigned to a leaf node. The predic-tion problem is turned into a path probability maximizationproblem. If a path (b0, b1, · · · , bdlog |V|e) is identified from thetree root to the node uk, i.e., b0 = root and bdlog |V|e = uk,then the probability can be rewritten as

P (ui|xun) =

dlog |V|e∏l=1

P (bl|xun), (283)

where P (bl|xun) can be modeled by a binary classifier de-noted as

P (bl|xun) =1

1 + e−xbl·xun

. (284)

Here the parameters involved in the learning process includethe representations for both the nodes in the network as wellas the nodes in the constructed binary trees.

8.2.2 LINETo handle the real-world information networks, the embed-ding models need to have several requirements: (1) pre-serve the first-order and second-order proximity betweenthe nodes, (2) scalable to large sized networks, and (3) ableto handle networks with different links: directed and undi-rected, weighted and unweighted. In this part, we will in-troduce another homogeneous network embedding model,named LINE [102].

First-order Proximity

In the network embedding process, the network structureshould be effectively preserved, where the node closeness isdefined as the node proximity concept in LINE. The first-order proximity in a network denotes the local pairwise prox-imity between nodes. For a link (u, v) ∈ E in the network,the first-order proximity denotes the weight of link (u, v) inthe network (or 1 if the network is unweighted). Meanwhile,if link (u, v) doesn’t exist in the network, the first-orderproximity between them will be 0 instead. To model thefirst-order proximity, for a given link (u, v) ∈ E in the net-work G, LINE defines the joint probability between nodes uand v as

p1(u, v) =1

1 + e−xu·xv, (285)

where xu,xv ∈ Rd denote the vector representations ofnodes u and v respectively.

Function p1(·, ·) defines the proximity distribution in thespace of V×V. Meanwhile, given a network G, the empiricalproximity between nodes u and v can be denoted as

p1(u, v) =w(u,v)∑

(u,v)∈E w(u,v)

. (286)

To preserve the first-order proximity, LINE defines the ob-jective function for the network embedding as

J1 = d(p1(·, ·), p1(·, ·)), (287)

where function d(·, ·) denotes the distance between betweenthe introduced proximity distribution and the empirical prox-imity distribution. By replacing the distance function d(·, ·)with the KL-divergence and omitting some constants, theobjective function can be rewritten as

J1 = −∑

(u,v)∈E

w(u,v) log p1(u, v). (288)

By minimizing the objective function, LINE can learn thefeature representation xu for each node u ∈ V in the net-work.

Second-order Proximity

In the real-world social networks, the links among the nodescan be very sparse, where the first-order proximity can hardlypreserve the complete structure information of the network.LINE introduce the concept of second-order proximity, wheredenotes the similarity between the neighborhood structureof nodes. Given a user pair (u, v) in the network, the morecommon neighbors shared by them, the closer users u andv are in the network. Besides the original representation xufor node u ∈ V, the nodes are also associated with a fea-ture vector representing its context in the network, which isdenoted as yu ∈ Rd.Formally, for a given link (u, v) ∈ E , the probability of con-text yv generated by node u can be represented as

p2(v|u) =ex>u ·yv∑

v′∈V ex>u ·yv′

. (289)

Slightly different from first-order proximity, the second-orderempirical proximity is denoted as

p2(v|u) =w(u,v)

D(u). (290)

By minimizing the difference between the introduced prox-imity distribution and the empirical proximity distribution,the objective function for the second-order proximity can berepresented as

J2 =∑u∈V

λud(p2(·|u), p2(·|u)), (291)

where λu denotes the prestige of node u in the network.Here, by replacing the distance function d(·|·) with the KL-divergence and setting λu = D(u), the second-order proxim-ity based objective function can be represented as

J2 = −∑

(u,v)∈E

w(u,v) log p2(v|u). (292)

Model Optimization

Instead of combining the first-order proximity and second-order proximity into a joint optimization function, LINElearns the embedding vectors based on Equations 288 and292 respectively, which will be further concatenated togetherto obtain the final embedding vectors.

In optimizing objective function 292, LINE needs to calcu-late the conditional probability P (·|u) for all nodes u ∈ V inthe network, which is computational infeasible. To solve theproblem, LINE uses the negative sampling approach instead.For each link (u, v) ∈ E , LINE samples a set of negative linksaccording to some noisy distribution.

Formally, for link (u, v) ∈ E , the set of negative links sam-pled for it can be represented as L−(u,v) ⊂ V × V. The ob-

jective function defined for link (u, v) can be representedas

log σ(y>v · xu) +∑

(u,v′)∈L−(u,v)

log σ(−y>v′ · xu), (293)

where σ(·) is the sigmoid function. The first term in theabove equation denotes the observed links, and the second

52

term represents the negative links drawn from the noisy dis-tribution. Similar approach can also be applied to solve theobjective function in Equation 288 as well. The new objec-tive function can be solved with the asynchronous stochasticgradient algorithm (ASGD), which samples a mini-batch oflinks and then update the parameters.

8.2.3 node2vecIn LINE, the closeness among nodes in the networks is pre-served based on either the first-order proximity or the second-order proximity. In a recent work, node2vec [35], the au-thors propose to preserve the proximity between nodes witha sampled set of nodes in the network.

node2vec Framework

Model node2vec is based on the SkipGram in language mod-eling, and the objective function of node2vec can be formallyrepresented as

max∑u∈V

logP (Γ(u)|xu). (294)

where xu denotes the latent feature vector learnt for nodeu and Γ(u) represents the neighbor set of node u in thenetwork.

To simplify the problem and make the problem solvable,some assumptions are made to approximate the objectivefunction into a simpler form.

• Conditional Independence Assumption: Given the la-tent feature vector xu of node u, by assuming the ob-servation of node in set Γ(u) to be independent, theprobability equation can be rewritten as

P (Γ(u)|xu) =∏

v∈Γ(u)

P (v|xu). (295)

• Symmetric Node Effect : Furthermore, by assuming thesource and neighbor nodes have a symmetric effect oneach other in the feature space, the conditional prob-ability P (v|xu) can be rewritten as

P (v|xu) =ex>v ·xu∑

v′∈V ex>v′ · xu

. (296)

Therefore, the objective function can be simplified as

maxX

∑u∈V

[− logZu +∑

v′∈Γ(u)

x>v′ · xu], (297)

where Zu =∑v′∈V e

x>v′ ·xu . Term Zu will be different for

different nodes u ∈ V, which is expensive to compute forlarge networks, and node2vec proposes to apply the negativesampling technique instead. The main issue discussed innode2vec is about sampling the neighborhood set Γ(u) fromthe network.

BFS and DFS

In the SkipGram, neighborhood set Γ(u) denotes the directneighbors of u in the network, i.e., the first-order proxim-ity of network local structures. Besides the local structure,node2vec can also capture other network structures with setΓ(u) depending on the sampling strategy being applied. Tofairly compared different sampling strategies, the neighbor-hood set Γ(u) is usually limited with size k, i.e., |Γ(u)| = k.Two extreme sampling strategies for the neighborhood setΓ(u) are

• BFS : BFS samples the nodes directly connected tonode u and involve them in the neighborhood set Γ(u)first, and then go to the second layer, where the nodesare two hopes away from u in the network, until thesize k is met. Generally, the Γ(u) sampled via BFS cansufficiently characterize the local neighborhood struc-ture of the network. The node2vec model learnt basedon BFS sampling strategy provides a micro-view of thenetwork structure.

• DFS : DFS samples the nodes which are sequentiallyreachable from u at an increasing distance and involvethem into the neighborhood set Γ(u) first. In DFS,the sampled nodes reflect a more global neighborhoodof the network. The node2vec model learnt based onBFS sampling strategy provides a macro-view of thenetwork neighborhood structure of the network, whichcan be essential for inferring the communities based onhomophily.

However, the BFS and DFS sampling strategy may also suf-fer from some shortcomings. For BFS, only a small propor-tion of the network is explored surrounding node u in thesampling. Meanwhile, for DFS, the sampled nodes far awayfrom the source node u tend to involve complex dependen-cies relationships.

Random Walk based Search

To overcome the shortcomings of BFS and DFS, node2vecproposes to apply random walk to sample the neighborhoodset Γ(u) instead. Given a random walk W , the node Wresides at in step i can be represented as variable si ∈ V.The complete sequence of nodes that W has resides at canbe represented as s0, s1, · · · , sk, where s0 denotes the initialnode starting the walk. The transitional probability fromnode u to v in W in the ith step can be represented as

P (si = v|si−1 = u) =

w(u,v) if (u, v) ∈ E ,0, otherwise,

(298)

where w(u,v) denotes the normalized weight of link (u, v) inthe network (w(u,v) = 1 if the network is unweighted).

Traditional random walk model doesn’t take account for thenetwork structure and can hardly explore different networkneighborhoods. node2vec adapts the random walk modeland introduce the 2nd order random walk model with param-eters p and q, which will help guide the walk. In node2vec,let’s assume the walk just traversed link (t, u) and can go tonode v in the next step. Formally, the transitional probabil-ity of link (u, v) is adjusted with parameter αp,q(t, v) (i.e.,w(u,v) = αp,q(t, v) · w(u,v)), where

αp,q(t, v) =

1p, if dt,v = 0,

1, if dt,v = 1,1q, if dt,v = 2,

(299)

where dt,v denotes the shortest distance between nodes t andv in the network. Since the walk can go from t to u, andthen from u to v, the distance from t to v will be at most 2.

Parameters p and q control the walk transition sequenceeffectively, where parameter p is also called the return pa-rameter and q is called the in-out parameter in node2vec.

• Return Parameter p: In the case that dt,v = 0, i.e.,t = v, the probability adjusting parameter 1

pcontrols

53

the chance to returning to the node t. By assigning pwith a large value, the random walk model will have alower chance to go back to node t that the model hasjust visited. Meanwhile, by assigning p with a smallvalue, the random walk model will backtrack a stepand keep exploring the local nodes that it has visitedalready.

• In-out Parameter q: In the case that dt,v = 2, nodest and v are not directly connected but are reachablevia the intermediate node u. Therefore, parameter qcontrols the chance of exploring the structure that arefar away from the visited nodes. If q > 1, the randomwalk model is biased to explore nodes that are closerto t, since 1

qis smaller than the probability of visiting

nodes in case that dt,v = 1. Meanwhile, if q < 1, therandom walk will be inclined to visit nodes that arefar away from t in the network instead.

8.3 Heterogeneous Network EmbeddingThe embedding modes introduced in the previous section areproposed for homogeneous networks, which will encountergreat challenges when applied to the heterogeneous net-works. In this section, we will introduce the recent de-velopment of embedding problems for heterogeneous net-works, including HNE (Heterogeneous Information NetworkEmbedding) [14], Path-Augmented Heterogeneous NetworkEmbedding [16], and HEBE (HyperEdge Based Embedding)[36].

8.3.1 HNE: Heterogeneous Information Network Em-bedding

Generally, the data available in the online social networksdoesn’t exist in isolation, and different types of data mayco-exist simultaneously. For instances, in the posts and ar-ticles written by users online, there may exist both text andimage. The co-existence interactions of text and image inthe same articles can be formed either explicitly or implic-itly with the linkages between text and images. Meanwhile,there also exist correlations between the text data as well asimage data due to the hyperlinks among the text and com-mon tags/categories shared by different images. The HNE[14] model is proposed a heterogeneous information networkinvolving text and image.

Terminology Definition and Problem Formulation

The network studied in HNE involves both text and images,which can be represented as the Text-Image HeterogeneousInformation Network as follows:

Definition 50. (Text-Image Heterogeneous InformationNetwork): Let G = (V, E) denote the heterogeneous infor-mation network involving text and image as the nodes, aswell as diverse categories of links among them. Formally,the node set V can be decomposed into two disjoint subsetsV = VT ∪ I, where T denotes the text set and I representsthe image set. Meanwhile, among the text, image as wellas between text and images, there may exist different kindsof connections, which can be denoted as sets ET,T , EI,I , andET,I respectively in the link set E.

Furthermore, the text and image nodes are also summarizedby unique content information. For instance, for each imageik ∈ I, it can be represented as a tensor Xk ∈ RdI×dI×3,where dI denotes the dimension of the image in RGB color

space. Meanwhile, for each text tk ∈ T , it can be repre-sented as a raw feature vector zk ∈ RdT , where dT denotesthe dimension of the text represented with the bag-of-wordsvectors normalized by TF-IDF. For the images involved inset I, the connections among them can be represented asmatrix AI,I ∈ +1,−1|I|×|I|, where entry AI,I(j, k) = +1if there exist a link connecting nodes ij and ik in the net-work; and A(i, j) = −1 otherwise. In a similar way, theadjacency matrices AT,T and AI,T can be defined to repre-sent the connections among texts as well as those betweenimages and texts.

For all the connections among nodes in set V, they can berepresented with matrix A ∈ +1,−1|V|×|V|, where entryA(i, j) = +1 if the corresponding nodes are connected by alink in the network; and A(i, j) = −1 otherwise.

To handle the diverse information in the Text-Image Het-erogeneous Information Network, a good way is to learn thefeature vector representations of nodes inside the network.Formally, the network embedding problem studied here in-cludes the learning of mappings U : X→ Rr and V : z→ Rrwhich will project the images and texts into a shared featurespace of dimension r. Furthermore, the network structurecan be preserved in the embedding process, where connectednodes will be projected to a close region.

HNE Model

For each image ik ∈ I, HNE proposes to transform itsrepresentation from 3-way tensor Xk into a column vec-

tor xk ∈ Rd′I , where d′I denotes the dimension of the fea-

ture vector space. Different methods can be applied in thetransformation. For instance, a simple way to do the trans-formation is to stack the column vectors of the image andappend them together, in which case d′I will be equal todI ×dI ×3. Some other advanced techniques have also beenproposed, like feature extraction of the images as well aspre-embedding of images, which will not be introduced heresince they are not part of the network embedding problemstudied in this section.

Formally, the linear mapping functions for the image andtext data are denoted as matrices U : x→ Rr and V : z→Rr, which projects the data into a feature space of dimensionr. The embedding process of image ij ∈ I and text tk ∈ Tcan be denoted as

xj = U>xj , (300)

zk = V>zk, (301)

where vectors xk and zk denotes the embedded feature rep-resentation of image ik and text tk respectively.

The similarity between the embedded feature representationof images and texts can be defined as

s(xj ,xk) = x>j zk = x>j (UU)xk = x>j MI,Ixk, (302)

s(zj , zk) = z>j zk = z>j (VV)zk = z>j MT,T zk. (303)

respectively. Furthermore, since the images and texts areembedded into a common feature space, the similarity be-tween the nodes of different categories can be representedas

s(xj , zk) = x>j zk = x>j (UV)zk = x>j MI,T zk. (304)

In the above equations, via the positive semi-definite matri-ces MI,I , MT,T , MI,T the similarity of the texts and imagescan be effectively captured.

54

Meanwhile, based on the network structure, the empiricalsimilarities of the nodes in the networks can be denotedby their structures. For instance, the empirical similaritybetween images ij , ik ∈ I can be denoted as

s(xj ,xk) = AI,I(j, k). (305)

The loss function introduced by the image pair ij , ik is de-fined as

L(xj ,xk) = log(

1 + e(−AI,I (j,k)s(xj ,xk))). (306)

In a similar way, the loss functions for the text pairs, andimage-text pairs can be defined. By combining the loss func-tions together, the objective function of HNE can be repre-sented as

minU,V

1

NI,I

∑ij ,ik∈I

L(xj ,xk) +λ1

NT,T

∑tj ,tk∈T

L(zj , zk) (307)

+λ2

NI,T

∑ij∈I,tk∈T

L(xj , zk) + λ3(‖U‖2F + ‖V‖2F ), (308)

where NI,I = |I × I \ (ij , ij)ij∈I | denotes the numberof image pairs, and λ1, λ2, λ3 denote the weights of theloss terms introduced by texts, image-text, and the regu-larization term respectively. The function can be solved al-ternatively with coordinate descent by fixing one variableand updating the other variable. More detailed informationabout the solution is available in [14].

8.3.2 Path-Augmented Heterogeneous Network Em-bedding

For most of the embedding models, they are based on theassumptions that the node feature representations can belearnt with the neighborhood. Here, the neighborhood de-notes either the set of nodes directed connected to the targetnode or the nodes accessible to the target node via ran-dom walk. In [16], a new heterogeneous network embeddingmodel has been introduced, which uses the meta path toexploit the rich information information in heterogeneousnetworks.

In the path augmented network embedding model, a set ofmeta paths are defined based on the heterogeneous networkschema. For the node pairs in the network which are con-nected based on each of the meta paths, their correlationis represented with a meta path augmented adjacency ma-trix. For instance, based on the rth type of meta path, thecorresponding adjacency matrix can be denoted as Mr. Inheterogeneous networks, some of the meta paths will lotsof concrete meta path instances connecting nodes. For in-stance, in the online social networks, the meta path “Userwrite−−−→ Post

contain−−−−−→ Wordcontain←−−−−− Post

write←−−− User” willhave lots of instances, since users write lots of posts andeach post will contain many words. Therefore, matrix Mr

is usually normalized to ensure∑i,jM

r(i, j) = 1.

The learning framework used here is very similar to thoseintroduced LINE and node2vec in Sections 8.2.2 and 8.2.3.The proximity between nodes ni, nj ∈ V based on the rthmeta path can be denoted as

P (nj |ni; r) =ex>i xj∑

j′∈DST (r) ex>i xj

, (309)

where xi and xj denote the embedding vectors of nodesni and nj respectively, and DST (r) denotes the set of all

possible nodes that are in the destination side of path r.

In the real world, set DST (r) is usually very large, whichrenders the above conditional probability very expensive tocompute. In [16], the authors propose to follow the tech-niques proposed in the existing works, and applies negativesampling to reduce the computation costs. Formally, theapproximated objective function can be represented as

log P (nj |ni; r) (310)

≈ logσ(x>i xj) +

k∑l=1

Enj′∼Prn(nj′ )[log σ(−x>i xj′ − br)], (311)

where j′ denotes the negative node sampled from the pre-defined noise distribution, k denotes the number of samplednodes, and br is the bias term added for the rth meta path.The embedding vectors xni for node ni in the network aswell as the bias terms br for the rth meta path can be learntwith the stochastic gradient descent method

8.3.3 HEBE: HyperEdge Based EmbeddingThe embedding models proposed so far mostly only considerthe single typed objective interactions, while the stronglytyped objects involving multiple kinds of interactions amongdifferent objectives has achieved an increasing interest inrecent years. In this part, we will introduce a new em-bedding framework HEBE (HyperEdge Based Embedding)which captures strongly-typed objective interactions as awhole in the embedding process [36].

Terminology Definition and Problem Formulation

In HEBE, the subgraph centered with one certain type oftarget object in the whole network is defined as an event.Depending on the number of node types involved in theevent, they can be further categorized into homogeneousevent and heterogeneous event

Definition 51. (Event): Formally, the objects involvedin the network can be represented as set X = XtTt=1, whereXt denotes the set of objects belonging to the tth type. Anevent Qi is denoted as a subset of nodes involved in it andcan be represented as (Vi, wi), where Vi denotes the set ofinvolved objects and wi is the occurrence number of eventQi in the network. The object set Vi can be further dividedinto several subsets Vi =

⋃Tt=1 V

ti depending on the object

categories.

In the above event definition, links connecting the nodes inthe network are involved by default, which are not men-tioned here for simplicity reasons. For event Qi = (Vi, wi),if more than one type of nodes are covered, it will be calleda homogeneous event; otherwise, it is a heterogeneous event.

Formally, the set of events involved in the network can berepresented as event data D = QiNi . In the embeddingproblem, the objective is to learn a function f : X → Rd toproject the different types of objects involved in the eventdata D into a shared feature space of dimension d. Mean-while, the proximity of each event should be preserved. Here,the proximity of an event is defined as the likelihood of ob-serving a target object given all other participating objectsin the same event.

Objective Function Introduction

Given an event Qi = (Vi, wi), let u ∈ Vi denote an objectinvolved in the event. The remaining nodes in the event canbe denoted as the context of u, i.e., C = Vi \ u. Let’s

55

assume object u belongs to category X1 (i.e., u ∈ X1), theprobability of predicting the target object u given its contextC is defined as

P (u|C) =eS(u,C)∑

v∈X1eS(v,C) , (312)

where S(u, C) denotes the similarity between u and contextC and can be calculated by summing the inner products ofobject pairs in u × C.The loss function defined in HEBE is based on the Kullback-Leibler (KL) divergence between the conditional probability

P (·|C) and the emperical probability P (·|C), which can bedefined as

L = −T∑t=1

∑Ct∈Pt

λCtKL(P (·|C), P (·|C)), (313)

where λCt denotes the weight of context Ct and is defined asthe occurrence of it in the event data D

λCt =

N∑i=1

wiI(Ct ∈ Vi)|Pi,t|

. (314)

In the above equation, Pt denotes the sample space of con-text Ct and Pi,t is the constraint sample space by object setVi. Function I(·) is a binary function which takes value 1 ifthe condition holds. By replacing λCt , the loss function canbe rewritten as follows

L = −N∑i=1

wi

T∑t=1

1

|Pi,t|∑Ct∈Pt

P (·|C), (315)

Learning Algorithm Description

The conditional probability involved in the loss function isvery hard to calculate especially in the case that the objectset X1 that u belongs to is very big. To address the problem,HEBE proposes to use the noise pairwise ranking (NPR) toapproximate the probability calculation instead.

Formally, the conditional probability function can be rewrit-ten as

P (u|C) =

1 +∑

v∈X1\u

eS(v,C)−S(u,C)

−1

. (316)

Instead of enumerating all the nodes v ∈ X1 \ u, a smallset of noise samples are selected from X1 \u, where an in-dividual noise sample can be denoted as vn. HEBE proposeto maximize the following probability instead

P (u > un|C) = σ(−S(vn, C) + S(u, C)). (317)

It is shown that

P (u|C) >∏vn 6=u

P (u > vn|C). (318)

And the conditional probability can be approximated as fol-lows

P (u|C) ∝ Evn∼Pn logP (u > vn|C), (319)

where Pn denotes the noise distribution and it is set as Pn ∝D(u)

34 with regarding to the degree of u. By replacing the

probability into the loss function, the loss function will be

L = −N∑i=1

wi

T∑t=1

1

|Pi,t|∑Ct∈Pt

Evn∼Pn logP (u > vn|C). (320)

The objective function can be solved with the asynchronousstochastic gradient descent (ASGD) algorithm.

8.4 Emerging Network Embedding across Net-works

We have introduce several network embedding models inthe previous sections already. However, when applied tohandle real-world social network data, these existing embed-ding models can hardly work well. The main reason is thatthe network internal social links are usually very sparse inonline soical networks [102], which can hardly preserve thecomplete network structure. For a pair of users who are notdirected connected, these models will not be able determinethe closeness of these users’ feature vectors in the embed-ding space. Such a problem will be more severe when itcomes to the emerging social networks [137], which denotethe newly created online social networks containing very fewsocial connections.

In this section, we will study the emerging network em-bedding problem across multiple aligned heterogeneous so-cial networks simultaneously. In the concurrent embeddingprocess, the emerging network embedding problem aims atdistilling relevant information from both the emerging andother aligned mature networks to derive compliment knowl-edge and learn a good vector representation for user nodesin the emerging network. Formally, the studied problem canbe formulated as follows.

Given two aligned networks G = ((G(1), G(2)), (A(1,2))), where

G(1) is an emerging network and G(2) is a mature network.In the emerging network embedding problem, we aim at

learning a mapping function f (i) : U (i) → Rd(i)

to projectthe user node in G(i) to a feature space of dimension d(i)

(d(i) |U|(i)). The objective of mapping functions f (i)

is to ensure the embedding results can preserve the net-work structural information, where similar user nodes willbe projected to close regions. Furthermore, in the embed-ding process, emerging network embedding also wants totransfer information between G(2) and G(1) to overcome theinformation sparsity problem in G(1).

To solve the problem, in this section, we will introduce anovel multiple aligned heterogeneous social network embed-ding framework, named DIME proposed in [135]. To handlethe heterogeneous link and attribute information in the net-works in a unified analytic, DIME introduces the alignedattribute augmented heterogeneous network concept. Fromthese networks a set of meta paths are introduced to rep-resent the diverse connections among users in online socialnetworks, and a set of meta proximity measures are definedfor each of the meta paths denoting the closeness amongusers. These meta proximity information will be fed into adeep learning framework, which takes the input informationfrom multiple aligned heterogeneous social networks simul-taneously, to achieve the embedding feature vectors for allthe users in these aligned networks. Based on the connectionamong users, framework DIME aims at embedding closeuser nodes to a close area in the lower-dimensional featurespace for each of the social network respectively. Meanwhile,framework DIME also poses constraints on the feature vec-tors corresponding to the shared users across networks tomap them to a relatively close region as well. In this way,information can be transferred from the mature networksto the emerging network and solve the information sparsity

56

problem.

8.4.1 Proposed MethodsFor each attributed heterogeneous social network, the close-ness among users can be denoted by the friendship linksamong them, where friends tend to be closer compared withuser pairs without connections. Meanwhile, for the userswho are not directly connected by the friendship links, fewexisting embedding methods can figure out their closeness,as these methods are mostly built based on the direct friend-ship link only. In this section, the potential closeness scoresamong the users can be computed with the heterogeneousinformation in the networks based on meta path concept[97], which are formally called the meta proximity in [135].

Friendship based Meta Proximity

In online social networks, the friendship links are the mostobvious indicator of the social closeness among users. On-line friends tend to be closer with each other compared withthe user pairs who are not friends. Users’ friendship linksalso carry important information about the local networkstructure information, which should be preserved in the em-bedding results. Based on such an intuition, the friendshipbased meta proximity concept can be represented as follows.

Definition 52. (Friendship based Meta Proximity): For

any two user nodes u(1)i , u

(1)j in an online social network

(e.g., G(1)), if u(1)i and u

(1)j are friends in G(1), the friend-

ship based meta proximity between u(1)i and u

(1)j in the net-

work is 1, otherwise the friendship based meta proximityscore between them will be 0 instead. To be more specific, the

friendship based meta proximity score between users u(1)i , u

(1)j

can be represented as p(1)(u(1)i , u

(1)j ) ∈ 0, 1, where term

p(1)(u(1)i , u

(1)j ) = 1 iff (u

(1)i , u

(1)j ) ∈ E(1)

u,u.

Based on the above definition, the friendship based metaproximity scores among all the users in network G(1) can

be represented as matrix P(1)Φ0∈ R|U

(1)|×|U(1)|, where en-

try P(1)Φ0

(i, j) equals to p(1)(u(1)i , u

(1)j ). Here Φ0 denotes the

simplest meta path of length 1 in the form Ufollow−−−−→ U,

and its formal definition will be introduced in the followingsubsection.

When network G(1) is an emerging online social networkwhich has just started to provide services for a very shorttime, the friendship links among users inG(1) tend to be verylimited (majority of the users are isolated in the networkwith few social connections). In other words, the friendship

based meta proximity matrix P(1)Φ0

will be extremely sparse,where very few entries will have value 1 and most of theentries are 0s. With such a sparse matrix, most existingembedding models will fail to work. The reason is that thesparse friendship information available in the network canhardly categorize the relative closeness relationships amongthe users (especially for those who are even not connectedby friendship links), which renders these existing embeddingmodels may project all the nodes to random regions.

To overcome such a problem, besides the social links, DIMEproposes to calculate the potential proximity scores for theusers with the diverse link and attribute information avail-able in the heterogeneous networks. To handle the diverselinks and attributes simultaneously in a unified analytic,

DIME will treat the attributes as nodes as well and intro-duce the attribute augmented network. If a node has certainattributes, a new type of link “have” will be added to con-nected the node and the newly added attribute node. Byextending the meta path definition introduced in Section 3to incorporate the attribute information, set of different so-cial meta path Φ0,Φ1,Φ2, · · · ,Φ7 can be extracted fromthe network, whose notations, concrete representations andthe physical meanings are illustrated in Table 1. Here, metapaths Φ0−Φ4 are all based on the user node type and followlink type; meta paths Φ5 − Φ7 involve the user, post nodetype, attribute node type, as well as the write and have linktype. Based on each of the meta paths, there will exist aset of concrete meta path instances connecting users in thenetworks. For instance, given a user pair u and v, they mayhave been checked-in at 5 different common locations, whichwill introduce 5 concrete meta path instance of meta pathΦ7 connecting u and v indicating their strong closeness (inlocation check-ins). In the next subsection, we will introducehow to calculate the proximity score for the users based onthese extracted meta paths.

Heterogeneous Network Meta Proximity

The set of attribute augmented social meta paths Φ0,Φ1,Φ2,· · · ,Φ7 extracted in the previous subsection create differ-ent kinds of correlations among users (especially for thosewho are not directed connected by friendship links). Withthese social meta paths, different types of proximity scoresamong the users can be captured. For instance, for the userswho are not friends but share lots of common friends, theymay also know each other and can be close to each other;for the users who frequently checked-in at the same places,they tend to be more close to each other compared withthose isolated ones with nothing in common. Therefore,these meta paths can help capture much broader networkstructures compared with the local structure captured bythe friendship based meta proximity talked about in subsec-tion 8.4.1. In this part, we will introduce the method tocalculate the proximity scores among users based on thesesocial meta paths.

Similar to the meta paths shown in Table 6.3.1, all the socialmeta paths extracted from the networks can be represented

as set Φ1,Φ2, · · · ,Φ7. Given a pair of users, e.g., u(1)i and

u(1)j , based on meta path Φk ∈ Φ1,Φ2, · · · ,Φ7, the set of

meta path instances connecting u(1)i and u

(1)j can be repre-

sented as P(1)Φk

(u(1)i , u

(1)j ). Users u

(1)i and u

(1)j can have multi-

ple meta path instances going into/out from them. Formally,

all the meta path instances going out from user u(1)i (or going

into u(1)j ), based on meta path Φk, can be represented as set

P(1)Φk

(u(1)i , ·) (or P(1)

Φk(·, u(1)

j )). The proximity score between

u(1)i and u

(1)j based on meta path Φk can be represented as

the following meta proximity concept formally.

Definition 53. (Meta Proximity): Based on social meta

path Φk, the meta proximity between users u(1)i and u

(1)j in

network G(1) can be represented as

p(1)Φk

(u(1)i , u

(1)j ) =

2|P(1)Φk

(u(1)i , u

(1)j )|

|P(1)Φk

(u(1)i , ·)|+ |P(1)

Φk(·, u(1)

j )|. (321)

Meta proximity considers not only the meta path instancesbetween users but also penalizes the number of meta path

57

instances going out from/into u(1)i and u

(1)j at the same time.

It is also reasonable. For instance, sharing some commonlocation check-ins with some extremely active users (whohave tens thousand checkins) may not necessarily indicatecloseness with them, since they may have common check-ins with so many other users due to his very large check-inrecord volume.

With the above meta proximity definition, the meta prox-imity scores among all users in the network G(1) based on

meta path Φk can be denoted as matrix P(1)Φk∈ R|U

(1)|×|U(1)|,

where entry P(1)Φk

(i, j) = p(1)Φk

(u(1)i , u

(1)j ). All the meta prox-

imity matrices defined for network G(1) can be represented

as P(1)ΦkΦk . Based on the meta paths extracted for network

G(2), similar matrices can be defined as well, which can be

denoted as P(2)ΦkΦk .

Deep DIME-SH Model

With these calculated meta proximity introduced in the pre-vious section, we will introduce the embedding frameworkDIME next. DIME is based on the aligned auto-encodermodel, which extends the traditional deep auto-encoder modelto the multiple aligned heterogeneous networks scenario. Inthis part, we will talk about the embedding model com-ponent for one heterogeneous information network in Sec-tion 8.4.1, which takes the various meta proximity matricesas the input. DIME effectively couples the embedding pro-cess of the emerging network with other aligned mature net-works, where cross-network information exchange and resultrefinement is achieved via the loss term defined based on theanchor links, which will be introduced in the next part.

When applying the auto-encoder model for one single homo-geneous network node embedding, e.g., for G(1), the modelcan be learned with the node meta proximity feature vec-

tors, i.e., rows corresponding to users in matrix P(1)Φ0

(intro-

duced in Section 8.4.1). In the case that G(1) is hetero-geneous, multiple node meta proximity matrices have been

defined before (i.e., P(1)Φ0,P

(1)Φ1, · · · ,P(1)

Φ7), how to fit these

matrices simultaneously to the auto-encoder models is anopen problem. In this part, we will introduce the single-heterogeneous-network version of framework DIME, namelyDIME-SH, which will be used as an important componentof framework DIME as well. For each user node in the net-work, DIME-SH computes the embedding vector based oneach of the proximity matrix independently first, which willbe further fused to compute the final latent feature vectorin the output hidden layer.

As shown in the architecture in Figure 14 (either the leftcomponent for network 1 or the right component for network2), about the same instance, DIME-SH takes different fea-ture vectors extracted from the meta paths Φ0,Φ1, · · · ,Φ7as the input. For each meta path, a series of separatedencoder and decoder steps are carried out simultaneously,whose latent vectors are fused together to calculate the fi-

nal embedding vector z(1)i ∈ Rd

(1)

for user u(1)i ∈ V(1). In

the DIME-SH model, the input feature vectors (based onmeta path Φk ∈ Φ0,Φ1, · · · ,Φ7) of user ui can be rep-

resented as x(1)i,Φk

, which denotes the row corresponding to

users u(1)i in matrix P

(1)Φk

defined before. Meanwhile, thelatent representation of the instance based on the featurevector extracted via meta path Φk at different hidden layers

can be represented as y(1),1i,Φk

,y(1),2i,Φk

, · · · ,y(1),oi,Φk.

One of the significant difference of model DIME-SH fromtraditional auto-encoder model lies in the (1) combination

of various hidden vectors y(1),oi,Φ0

,y(1),oi,Φ1

, · · · ,y(1),oi,Φ7 to obtain

the final embedding vector z(1)i in the encoder step, and (2)

the dispatch of the embedding vector z(1)i back to the hidden

vectors in the decoder step. As shown in the architecture,formally, these extra steps can be represented as

# extra encoder steps

y(1),o+1i = σ(

∑Φk∈Φ0,··· ,Φ7

W(1),o+1Φk

y(1),oi,Φk

+ b(1),o+1Φk

),

z(1)i = σ(W(1),o+2y

(1),o+1i + b(1),o+2).

# extra decoder steps

y(1),o+1i = σ(W(1),o+2z

(1)i + b(1),o+2),

y(1),oi,Φk

= σ(W(1),o+1Φk

y(1),o+1i + b

(1),o+1Φk

).

(322)

What’s more, since the input feature vectors are extremelysparse (lots of the entries have value 0s), simply feedingthem to the model may lead to some trivial solutions, like

0 vectors for both z(1)i and the decoded vectors x

(1)i,Φk

. Toovercome such a problem, another significant difference ofmodel DIME-SH from traditional auto-encoder model liesin the loss function definition, where the loss introduced bythe non-zero features will be assigned with a larger weight.In addition, by adding the loss function for each of the metapaths, the final loss function in DIME-SH can be formallyrepresented as

L(1) =∑

Φk∈Φ0,··· ,Φ7

∑ui∈V

∥∥∥(x(1)i,Φk− x

(1)i,Φk

) b

(1)i,Φk

∥∥∥2

2, (323)

where vector b(1)i,Φk

is the weight vector corresponding to

feature vector x(1)i,Φk

. Entries in vector b(1)i,Φk

are filled withvalue 1s except the entries corresponding to non-zero ele-

ment in x(1)i,Φk

, which will be assigned with value γ (γ > 1

denoting a larger weight to fit these features). In a similarway, the loss function for the embedding result in networkG(2) can be formally represented as L(2).

Deep DIME Framework

Even through DIME-SH has incorporate all these heteroge-neous information in the model building, the meta proximitycalculated based on which can help differentiate the close-ness among different users. However, for the emerging net-works which just start to provide services, the informationsparsity problem may affect the performance of DIME-SHsignificantly. In this part, we will introduce DIME, whichcouples the embedding process of the emerging network withanother mature aligned network. By accommodating theembedding between the aligned networks, information canbe transferred from the aligned mature network to refine theembedding result in the emerging network effectively. Thecomplete architecture of DIME is shown in Figure 14, whichinvolve the DIME-SH components for each of the alignednetworks, where the information transfer component alignsthese separated DIME-SH models together.

To be more specific, given a pair of aligned heterogeneousnetworks G = ((G(1), G(2)),A(1,2)) (G(1) is an emerging net-

work and G(2) is a mature network), the embedding re-

sults can be represented as matrices Z(1) ∈ R|U(1)|×d(1)

and

Z(2) ∈ R|U(2)|×d(2)

for all the user nodes in G(1) and G(2)

respectively. The ith row of matrix Z(1) (or the jth row

of matrix Z(2)) denotes the encoded feature vector of user

58

…

…

…

……

…

…

…

Network 1 Network 2

NetworkFusion

Component

………

……

…

…

…

…

……… …

……

…

……… … …

…

…

…

…… …

…………

z(1)i

z(2)j

y(2),o+1j

y(2),o+1j

x(2)j,0

y(2),oj,0

y(2),oj,0

x(2)j,0

x(2)j,1

x(2)j,7

x(2)j,7

x(2)j,1x

(1)i,1

x(1)i,0

x(1)i,7

x(1)i,7

x(1)i,1

x(1)i,0

y(1),1i,7

y(1),oi,7

y(1),oi,7

y(1),1i,7

y(2),1j,0

y(2),1j,0

y(1),o+1i

y(1),o+1i

… …

… …

… …

… …

Figure 14: The DIME Framework.

u(1)i in G(1) (or u

(2)j in G(2)). If u

(1)i and u

(2)j are the same

user, i.e., (u(1)i , u

(2)j ) ∈ A(1,2), by placing vectors Z(1)(i, :)

and Z(2)(j, :) in a close region in the embedding space, the

information from G(2) can be used to refine the embeddingresult in G(1).

Information transfer is achieved based on the anchor links,and we only care about the anchor users. To adjust the rowsof matrices Z(1) and Z(2) to remove non-anchor users andmake the same rows correspond to the same user, DIME in-troduces the binary inter-network transitional matrix T(1,2) ∈R|U

(1)|×|U(2)|. Entry T (1,2)(i, j) = 1 iff the corresponding

users are connected by anchor links, i.e., (u(1)i , u

(2)j ) ∈ A(1,2).

Furthermore, the encoded feature vectors for users in thesetwo networks can be of different dimensions, i.e., d(1) 6= d(2),which can be accommodated via the projection W(1,2) ∈Rd

(1)×d(2)

.

Formally, the introduced information fusion loss betweennetworks G(1) and G(2) can be represented as

L(1,2) =∥∥∥(T(1,2))>Z(1)W(1,2) − Z(2)

∥∥∥2

F. (324)

By minimizing the information fusion loss function L(1,2),the anchor users’ embedding vectors from the mature net-work G(2) can be used to adjust his embedding vectors inthe emerging network G(1). Even through in such a processthe embedding vector in G(2) can be undermined by G(1), itwill not be a problem since G(1) is the target network andDIME only care about the embedding result of the emergingnetwork G(1) in [135].

The complete objective function of framework include theloss terms introduced by the component DIME-SH for net-works G(1), G(2), and the information fusion loss, which can

be denoted as

L(G(1), G(2)) = L(1) + L(2) + α · L(1,2) + β · Lreg. (325)

Parameters α and β denote the weights of the informationfusion loss term and the regularization term. In the ob-jective function, term Lreg is added to the above objectivefunction to avoid overfitting, which can be formally repre-sented as

Lreg = L(1)reg + L(2)

reg + L(1,2)reg ,

L(1)reg =

∑o(1)+2i

∑Φk∈Φ0,··· ,Φ7

(∥∥∥W(1),iΦk

∥∥∥2

F+∥∥∥W(1),i

Φk

∥∥∥2

F

),

L(2)reg =

∑o(2)+2i

∑Φk∈Φ0,··· ,Φ7

(∥∥∥W(2),iΦk

∥∥∥2

F+∥∥∥W(2),i

Φk

∥∥∥2

F

),

L(1,2)reg =

∥∥∥W(1,2)∥∥∥2

2.

(326)

To optimize the above objective function, we utilize Stochas-tic Gradient Descent (SGD). To be more specific, the train-ing process involves multiple epochs. In each epoch, thetraining data is shuffled and a minibatch of the instancesare sampled to update the parameters with SGD. Such aprocess continues until either convergence or the trainingepochs have been finished.

9. CONCLUSION AND FUTURE DEVEL-OPMENTS

In this paper, we have introduced the current research workson broad learning and its applications on social media stud-ies. This paper has covered 5 main research directions aboutbroad learning based social media studies: (1) network align-ment, (2) link prediction, (3) community detection, (4) infor-mation diffusion and (5) network embedding. These prob-lems introduced in this chapter are all very important formany concrete real-world social network applications and

59

services. A number of nontrivial algorithms have been pro-posed to resolve these problems, which have been talkedabout in great detail in this paper respectively.

Both the broad learning and social media mining are verypromising research directions, and some potential future de-velopment directions are illustrated as follows.

1. Scalable Broad Learning Algorithms: Data gen-erated nowadays is usually of very large scale, and fu-sion of such big data from multiple sources togetherwill render the problem more challenging. For in-stance, the online social networks (like Facebook) usu-ally involve millions even billions of active users, andthe social data generated by these users in each daywill consume more than 600 TB storage space (in Face-book). One of the major future development about thebroad learning based social media mining is to developscalable data fusion and mining algorithms that canhandle such a large volume (of big data) challenge.One tentative approach is to develop information fu-sion algorithms based on distributed platforms, likeSpark and Hadoop [40], and handle the data with alarge distributed computing cluster. Another methodto resolve the scalability challenge is from the modeloptimization perspective. Optimizing existing learn-ing models and proposing new approximated learningalgorithms with lower time complexity are desirablein the future research projects. In addition, applica-tions of the latest deep learning models to fuse andmine the large-scale datasets can be another alterna-tive approach for the scalable broad learning on socialnetworks.

2. Multiple Sources Fusion and Mining: Current re-search works on multiple source data fusion and miningmainly focus on aligning entities in one single pair ofdata sources (i.e., two sources), where information ex-change between the sources mainly rely on the anchorlinks between these aligned entities. Meanwhile, whenit comes to fusion and mining of multiple (more thantwo) sources, the problem setting will be quite differ-ent and become more challenging. For example, in thealignment of more networks, the transitivity propertyof the inferred anchor links needs to be preserved [140].Meanwhile, in the information transfer from multipleexternal aligned sources to the target source, the in-formation sources should be weighted differently ac-cording to their importance. Therefore, the diversevariety of the multiple sources will lead to more re-search challenges and opportunities, which is also agreat challenge in big data studies. New informa-tion fusion and mining algorithms for the multi-sourcescenarios can be another great opportunity to explorebroad learning in the future.

3. Broader Learning Applications: Besides the re-search works on social network datasets, the third po-tential future development of broad learning and min-ing lies its broader applications on various categoriesof datasets, like enterprise internal data [142; 130; 145;144], geo-spatial data [131; 120; 132], knowledge basedata, and pure text data. Some prior research works onfusing enterprise context information sources, like en-terprise social networks, organizational chart and em-

ployee profile information have been done already [142;130; 145; 144]. Several interesting problems, like orga-nizational chart inference [142], enterprise link predic-tion [130], information diffusion at workplace [145] andenterprise employee training [144], have been studiedbased on the fused enterprise internal information. Inthe future, these areas are still open for exploration.Applications of broad learning techniques in other ap-plication problems, such as employee training, expertlocation and project team formation, will be both in-teresting problems awaiting for further investigation.In addition, analysis of the correlation of different trav-eling modalities (like shared bicycles [131; 120; 132],bus and metro train) with the city zonings in smartcity; and fusing multiple knowledge bases, like Doubanand IMDB, for knowledge discovery and truth findingare both good application scenarios for broad learningresearch works.

10. REFERENCES

[1] K. Aditya A. Menon and C. Elkan. Link prediction viamatrix factorization. In ECML/PKDD, 2011.

[2] L. Adamic and E. Adar. Friends and neighbors on theweb. Social Networks, 2001.

[3] C. Aggarwal, Y. Xie, and P. Yu. Gconnect: A connec-tivity index for massive disk-resident graphs. VLDBEndowment, 2009.

[4] A. Arenas, L. Danon, A. Dıaz-Guilera, P. M. Gleiser,and R. Guimera. Community analysis in social net-works. The European Physical Journal B, 2004.

[5] L. Backstrom and J. Leskovec. Supervised randomwalks: predicting and recommending links in socialnetworks. In WSDM, 2011.

[6] A.-L. Barabasi, H. Jeong, Z. Neda, E. Ravasz, A. Schu-bert, and T. Vicsek. Evolution of the social networkof scientific collaboration. In Physica A, 2002.

[7] M. Bayati, M. Gerritsen, D. Gleich, A. Saberi, andY. Wang. Algorithms for large, sparse network align-ment problems. In ICDM, 2009.

[8] D. Bertsekas. Constrained Optimization and LagrangeMultiplier Methods (Optimization and Neural Compu-tation Series). Athena Scientific, 1996.

[9] S. Bharathi, D. Kempe, and M. Salek. Competitiveinfluence maximization in social networks. In WINE,2007.

[10] T. Blomberg. Heat conduction in two and three dimen-sions : computer modelling of building physics appli-cations. PhD thesis, 1996.

[11] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston,and O. Yakhnenko. Translating embeddings for mod-eling multi-relational data. In NIPS. 2013.

[12] T. Bui and C. Jones. A heuristic for reducing fill-in insparse matrix factorization. In PPSC, 1993.

60

[13] T. Carnes, R. Nagarajan, S. Wild, and A. Zuylen.Maximizing influence in a competitive social network:a follower’s perspective. In ICEC, 2007.

[14] S. Chang, W. Han, J. Tang, G. Qi, C. Aggarwal, andT. Huang. Heterogeneous network embedding via deeparchitectures. In KDD, 2015.

[15] N. Chen. On the approximability of influence in socialnetworks. In SODA, 2008.

[16] T. Chen and Y. Sun. Task-guided and path-augmentedheterogeneous network embedding for author identifi-cation. CoRR, abs/1612.02814, 2016.

[17] W. Chen, A. Collins, R. Cummings, T. Ke, Z. Liu,D. Rincon, X. Sun, Y. Wang, W. Wei, and Y. Yuan.Influence Maximization in Social Networks When Neg-ative Opinions May Emerge and Propagate - MicrosoftResearch. In SDM, 2011.

[18] P. L. Combettes and V. Wajs. Signal Recoveryby Proximal Forward-Backward Splitting. MultiscaleModeling & Simulation, 2005.

[19] D. Conte, P. Foggia, C. Sansone, and M. Vento.Thirty years of graph matching in pattern recognition.IJPRAI, 2004.

[20] R. Dasgupta, B. Garcia, and R. Goodman. Systemicspread of an rna insect virus in plants expressing plantviral movement protein genes. Proceedings of the Na-tional Academy of Sciences, 2001.

[21] S. Datta, A. Majumder, and N. Shrivastava. Viralmarketing for multiple products. In ICDM, 2010.

[22] J. Dean and S. Ghemawat. Mapreduce: Simplifieddata processing on large clusters. Communications ofthe ACM, 2008.

[23] John S. deCani and Robert A. Stine. A note on deriv-ing the information matrix for a logistic distribution.The American Statistician, 1986.

[24] A. Doan, J. Madhavan, P. Domingos, and A. Halevy.Ontology matching: A machine learning approach. InHandbook on Ontologies. 2004.

[25] P. Domingos and M. Richardson. Mining the networkvalue of customers. In KDD, 2001.

[26] L. Dubins and D. Freedman. Machiavelli and thegale-shapley algorithm. The American MathematicalMonthly, 1981.

[27] D. Dunlavy, T. Kolda, and E. Acar. Temporal link pre-diction using matrix and tensor factorizations. TKDD,2011.

[28] C. Elkan and K. Noto. Learning classifiers from onlypositive and unlabeled data. In KDD, 2008.

[29] M. Eslami, A. Aleyasen, R. Moghaddam, and K. Kara-halios. Friend grouping algorithms for online socialnetworks: Preference, bias, and implications. In So-cial Informatics, 2014.

[30] J. Flannick, A. Novak, B. Srinivasan, H. McAdams,and S. Batzoglou. Graemlin: general and robust align-ment of multiple large interaction networks. Genomeresearch, 2006.

[31] S. Fortin. The graph isomorphism problem. Technicalreport, 1996.

[32] F. Fouss, A. Pirotte, J. Renders, and M. Saerens.Random-walk computation of similarities betweennodes of a graph with application to collaborative rec-ommendation. TKDE, 2007.

[33] R. Ghosh, K. Lerman, T. Surachawala, K. Voevod-ski, and S. Teng. Non-conservative diffusion andits application to social network analysis. CoRR,abs/1102.4639, 2011.

[34] D. Gibson, J. Kleinberg, and P. Raghavan. Infer-ring web communities from link topology. In HYPER-TEXT, 1998.

[35] A. Grover and J. Leskovec. Node2vec: Scalable featurelearning for networks. In KDD, 2016.

[36] H. Gui, J. Liu, F. Tao, M. Jiang, B. Norick, andJ. Han. Large-scale embedding learning in heteroge-neous event data. In ICDM, 2016.

[37] M. Hasan, V. Chaoji, S. Salem, and M. Zaki. Linkprediction using supervised learning. In SDM, 2006.

[38] M. Hasan and M. Zaki. In Social Network Data Ana-lytics. 2011.

[39] Q. Hu, S. Xie, J. Zhang, Q. Zhu, S. Guo, and P. Yu.Heterosales: Utilizing heterogeneous social networksto identify the next enterprise customer. In WWW,2016.

[40] S. Jin, J. Zhang, P. Yu, S. Yang, and A. Li. Synergisticpartitioning in multiple large scale social networks. InBigData, 2014.

[41] M. Kalaev, V. Bafna, and R. Sharan. Fast and accu-rate alignment of multiple protein networks. In RE-COMB. 2008.

[42] G. Karypis and V. Kumar. Analysis of multilevelgraph partitioning. In Supercomputing, 1995.

[43] G. Karypis and V. Kumar. Parallel multilevel k-waypartitioning scheme for irregular graphs. In Supercom-puting, 1996.

[44] G. Karypis and V. Kumar. Multilevel k-way partition-ing scheme for irregular graphs. Journal of Paralleland Distributed Computing, 1998.

[45] L. Katz. A new status index derived from sociometricanalysis. Psychometrika, 1953.

[46] E. Keeler. The value of remaining lifetime is close toestimated values of life. Journal of Health Economics,2000.

[47] D. Kempe, J. Kleinberg, and E. Tardos. Maximizingthe spread of influence through a social network. InKDD, 2003.

61

[48] D. Kempe, J. Kleinberg, and E. Tardos. Influentialnodes in a diffusion model for social networks. InICALP, 2005.

[49] J. Kleinberg. Authoritative sources in a hyperlinkedenvironment. J. ACM, 1999.

[50] K. Klemm and V. M. Eguıluz. Highly clustered scale-free networks. Physical Review E, 2002.

[51] X. Kong, J. Zhang, and P. Yu. Inferring anchorlinks across multiple heterogeneous social networks.In CIKM, 2013.

[52] I. Konstas, V. Stathopoulos, and J. M. Jose. On so-cial networks and collaborative recommendation. InSIGIR, 2009.

[53] J. Kostka, Y. Oswald, and R. Wattenhofer. Word ofmouth: Rumor dissemination in social networks. InSIROCCO, 2008.

[54] D. Koutra, H. Tong, and D. Lubensky. Big-align: Fastbipartite graph alignment. In ICDM, 2013.

[55] J. Lee, W. Han, R. Kasperovics, and J. Lee. Anin-depth comparison of subgraph isomorphism algo-rithms in graph databases. VLDB, 2012.

[56] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Pre-dicting positive and negative links in online social net-works. In WWW, 2010.

[57] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Signednetworks in social media. In CHI, 2010.

[58] J. Leskovec, K. Lang, and M. Mahoney. Empiricalcomparison of algorithms for network community de-tection. In WWW, 2010.

[59] D. Li, Z. Xu, N. Chakraborty, A. Gupta, K. Sycara,and S. Li. Polarity related influence maximization insigned social networks. PLOS, 2014.

[60] C. Liao, K. Lu, M. Baym, R. Singh, and B. Berger. Iso-rankn: spectral methods for global alignment of mul-tiple protein networks. Bioinformatics, 2009.

[61] D. Liben-Nowell and J. Kleinberg. The link-predictionproblem for social networks. J. Am. Soc. Inf. Sci.Technol., 2007.

[62] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learningentity and relation embeddings for knowledge graphcompletion. In AAAI, 2015.

[63] B. Liu, Y. Dai, X. Li, W. Lee, and P. Yu. Buildingtext classifiers using positive and unlabeled examples.In ICDM, 2003.

[64] L. Lu and T. Zhou. Link prediction in complex net-works: A survey. Physica A: Statistical Mechanics andits Applications, 2011.

[65] F. D. Malliaros and M. Vazirgiannis. Clustering andcommunity detection in directed networks: A survey.CoRR, abs/1308.0971, abs/1308.0971, 2013.

[66] F. Manne and M. Halappanavar. New effective multi-threaded matching algorithms. In IPDPS, 2014.

[67] M. McPherson, L. Smith-Lovin, and J. Cook. Birdsof a feather: Homophily in social networks. AnnualReview of Sociology, 2001.

[68] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarityflooding: A versatile graph matching algorithm andits application to schema matching. In ICDE, 2002.

[69] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, andJ. Dean. Distributed representations of words andphrases and their compositionality. In NIPS, 2013.

[70] T. N. Narasimhan. Fourier’s heat conduction equa-tion: History, influence, and connections. Proceedingsof the Indian Academy of Sciences - Earth and Plan-etary Sciences, 1999.

[71] R. Narayanam and A. Nanavati. Viral marketing forproduct cross-sell through social networks. In ECMLPKDD, 2012.

[72] M. Newman. Modularity and community structure innetworks. Proceedings of the National Academy of Sci-ences, 2006.

[73] I Nlsell. Stochastic models of some endemic infections.Mathematical Biosciences, 2002.

[74] S. Pan and Q. Yang. A survey on transfer learning.TKDE, 2010.

[75] R. Panigrahy, M. Najork, and Y. Xie. How user be-havior is related to social affinity. In WSDM, 2012.

[76] D. Park, R. Singh, M. Baym, C. Liao, and B. Berger.Isobase: a database of functionally related proteinsacross ppi networks. Nucleic Acids Research, 2011.

[77] R. Pastor-Satorras, C. Castellano, P. Van Mieghem,and A. Vespignani. Epidemic processes in complex net-works. Rev. Mod. Phys., 2015.

[78] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: On-line learning of social representations. In KDD, 2014.

[79] P. Petersen. Linear Algebra. 2012.

[80] H. Raguet, J. Fadili, and G. Peyre. A generalizedforward-backward splitting. SIAM Journal on Imag-ing Sciences, 2013.

[81] Baron RC, McCormick JB, and Zubeir OA. Ebolavirus disease in southern sudan: hospital dissemina-tion and intrafamilial spread. Bull World Health Or-gan., 1983.

[82] R. Read and D. Corneil. The graph isomorphism dis-ease. 2006.

[83] E. Richard, P. Savalle, and N. Vayatis. Estimationof simultaneously sparse and low rank matrices. InICML, 2012.

[84] M. Richardson and P. Domingos. Mining knowledge-sharing sites for viral marketing. In KDD, 2002.

[85] R. Roman. Community-based recommendations to im-prove intranet users’ productivity. Master’s thesis,2016.

62

[86] D. Shah and T. Zaman. Rumors in a network: Who’sthe culprit? IEEE Transactions on Information The-ory, 2011.

[87] W. Shao, J. Zhang, L. He, and P. Yu. Multi-source multi-view clustering via discrepancy penalty.In IJCNN, 2016.

[88] R. Sharan, S. Suthram, R. Kelley, T. Kuhn, S. Mc-Cuine, P. Uetz, T. Sittler, R. Karp, and T. Ideker.Conserved patterns of protein interaction in multiplespecies. 2005.

[89] J. Shi and J. Malik. Normalized cuts and image seg-mentation. TPAMI, 2000.

[90] Y. Shih and S. Parthasarathy. Scalable global align-ment for multiple biological networks. Bioinformatics,2012.

[91] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.The hadoop distributed file system. In MSST, 2010.

[92] E. H. Simpson. Measurement of diversity. Nature,1949.

[93] R. Singh, J. Xu, and B. Berger. Pairwise global align-ment of protein interaction networks by matchingneighborhood topology. In RECOMB, 2007.

[94] R. Singh, J. Xu, and B. Berger. Global alignment ofmultiple protein interaction networks with applicationto functional orthology detection. Proceedings of theNational Academy of Sciences, 2008.

[95] A. Smalter, J. Huan, and G. Lushington. Gpm: Agraph pattern matching kernel with diffusion for chem-ical compound classification. In IEEE BIBE, 2008.

[96] B. Sriperumbudur and G. Lanckriet. On the conver-gence of concave-convex procedure. In NIPS, 2009.

[97] Y. Sun, C. Aggarwal, and J. Han. Relation strength-aware clustering of heterogeneous information net-works with incomplete attributes. VLDB, 2012.

[98] Y. Sun, J. Han, X. Yan, P. Yu, and T. Wu. Pathsim:Meta path-based top-k similarity search in heteroge-neous information networks. PVLDB, 2011.

[99] Y. Sun, Y. Yu, and J. Han. Ranking-based cluster-ing of heterogeneous information networks with starnetwork schema. In KDD, 2009.

[100] J. Tang, Y. Chang, C. Aggarwal, and H. Liu. A sur-vey of signed network mining in social media. ACMComputing Surveys, to appear, CoRR abs/1511.07569,2015.

[101] J. Tang, H. Gao, X. Hu, and H. Liu. Exploiting ho-mophily effect for trust prediction. In WSDM, 2013.

[102] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, andQ. Mei. Line: Large-scale information network em-bedding. In WWW, 2015.

[103] H. Tong, C. Faloutsos, and J. Pan. Fast random walkwith restart and its applications. In ICDM, 2006.

[104] M. Trusov, A. Bodapati, and R. Bucklin. DeterminingInfluential Users in Internet Social Networks. Journalof Marketing Research, 2010.

[105] T. Turner, P. Qvarfordt, J. Biehl, G. Golovchinsky,and M. Back. Exploring the workplace communicationecology. In CHI, 2010.

[106] S. Umeyama. An eigendecomposition approach toweighted graph matching problems. IEEE TPAMI,1988.

[107] U. von Luxburg. A tutorial on spectral clustering.CoRR, 2007.

[108] X. Wang and G. Chen. Complex networks: small-world, scale-free and beyond. IEEE Circuits and Sys-tems Magazine, 2003.

[109] Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledgegraph embedding by translating on hyperplanes. InAAAI, 2014.

[110] Z. Wen and W. Yin. A feasible method for optimiza-tion with orthogonality constraints. Technical report,Rice University, 2010.

[111] R. West, H. Paskov, J. Leskovec, and C. Potts. Ex-ploiting social network structure for person-to-personsentiment analysis. TACL, 2014.

[112] K. Wilcox and A. T. Stephen. Are close friends theenemy? online social networks, self-esteem, and self-control. Journal of Consumer Research, 2012.

[113] J. Yang, J. McAuley, and J. Leskovec. Communitydetection in networks with node attributes. In ICDM,2013.

[114] Y. Yao, H. Tong, X. Yan, F. Xu, and J. Lu. Matri: amulti-aspect and transitive trust inference model. InWWW, 2013.

[115] J. Ye, H. Cheng, Z. Zhu, and M. Chen. Predictingpositive and negative links in signed social networksby transfer learning. In WWW, 2013.

[116] A. Yuille and A. Rangarajan. The concave-convex pro-cedure. Neural Computation, 2003.

[117] B. Zadrozny and C. Elkan. Transforming classifierscores into accurate multiclass probability estimates.In KDD, 2002.

[118] R. Zafarani and H. Liu. Connecting users across so-cial media sites: A behavioral-modeling approach. InKDD, 2013.

[119] W. Zangwill. Nonlinear Programming. Prentice-Hall,1969.

[120] Q. Zhan, J. Zhang, X. Pan, M. Li, and P. Yu. Dis-cover tipping users for cross network influencing. InIRI, 2016.

[121] Q. Zhan, J. Zhang, S. Wang, P. Yu, and J. Xie. Influ-ence maximization across partially aligned heteroge-nous social networks. In PAKDD, 2015.

63

[122] Q. Zhan, J. Zhang, P. Yu, S. Emery, and J. Xie. Infer-ring social influence of anti-tobacco mass media cam-paigns. In BIBM, 2016.

[123] H. Zhang, D. Nguyen, S. Das, H. Zhang, and M. Thai.Least cost influence maximization across multiple so-cial networks. CoRR, abs/1606.08927, 2016.

[124] J. Zhang, C. Aggarwal, and P. Yu. Rumor initiatordetection in infected signed networks. In ICDCS, 2017.

[125] J. Zhang, J. Chen, S. Zhi, Y. Chang, P. Yu, andJ. Han. Link prediction across aligned networks withsparse low rank matrix estimation. In ICDE, 2017.

[126] J. Zhang, J. Chen, J. Zhu, Y. Chang, and P. Yu. Linkprediction with cardinality constraints. In WSDM,2017.

[127] J. Zhang, L. Cui, P. Yu, and Y. Lv. Bl-ecd: Broadlearning based enterprise community detection via hi-erarchical structure fusion. In CIKM, 2017.

[128] J. Zhang, X. Kong, and P. Yu. Predicting social linksfor new users across aligned heterogeneous social net-works. In ICDM, 2013.

[129] J. Zhang, X. Kong, and P. Yu. Transferring heteroge-neous links across location-based social networks. InWSDM, 2014.

[130] J. Zhang, Y. Lv, and P. Yu. Enterprise social link pre-diction. In CIKM, 2015.

[131] J. Zhang, X. Pan, M. Li, and P. Yu. Bicycle-sharingsystem analysis and trip prediction. In MDM, 2016.

[132] J. Zhang, X. Pan, M. Li, and P. Yu. Bicycle-sharingsystems expansion: Station re-deployment throughcrowd planning. In SIGSPATIAL, 2016.

[133] J. Zhang, W. Shao, S. Wang, X. Kong, and P. Yu.Pna: Partial network alignment with generic stablematching. In IRI, 2015.

[134] J. Zhang, S. Wang, Q. Zhan, and P. Yu. Intertwinedviral marketing in social networks. In ASONAM, 2016.

[135] J. Zhang, C. Xia, C. Zhang, L. Cui, Y. Fu, and P. Yu.Bl-mne: Emerging heterogeneous social network em-bedding through broad learning with aligned autoen-coder. In ICDM, 2017.

[136] J. Zhang and P. Yu. Link prediction across heteroge-neous social networks: A survey. 2014.

[137] J. Zhang and P. Yu. Community detection for emerg-ing networks. In SDM, 2015.

[138] J. Zhang and P. Yu. Integrated anchor and social linkpredictions across partially aligned social networks. InIJCAI, 2015.

[139] J. Zhang and P. Yu. Mcd: Mutual clustering acrossmultiple social networks. In BigData Congress, 2015.

[140] J. Zhang and P. Yu. Multiple anonymized social net-works alignment. In ICDM, 2015.

[141] J. Zhang and P. Yu. Pct: Partial co-alignment of socialnetworks. In WWW, 2016.

[142] J. Zhang, P. Yu, and Y. Lv. Organizational chart in-ference. In KDD, 2015.

[143] J. Zhang, P. Yu, and Y. Lv. Enterprise communitydetection. In ICDE, 2017.

[144] J. Zhang, P. Yu, and Y. Lv. Enterprise employee train-ing via project team formation. In WSDM, 2017.

[145] J. Zhang, P. Yu, Y. Lv, and Q. Zhan. Informationdiffusion at workplace. In CIKM, 2016.

[146] J. Zhang, P. Yu, and Z. Zhou. Meta-path based multi-network collective link prediction. In KDD, 2014.

[147] J. Zhang, Q. Zhan, L. He, C. Aggarwal, andP. Yu. Trust hole identification in signed networks. InECMLPKDD, 2016.

[148] Y. Zhang and D. Yeung. Overlapping commu-nity detection via bounded nonnegative matrix tri-factorization. In KDD, 2012.

[149] Y. Zhao, E. Levina, and J. Zhu. Community extrac-tion for social networks. Proceedings of the NationalAcademy of Sciences, 2011.

[150] T. Zhou, L. Lu, and Y. Zhang. Predicting missing linksvia local information. The European Physical JournalB, 2009.

[151] J. Zhu, J. Zhang, L. He, Q. Wu, B. Zhou, C. Zhang,and P. Yu. Broad learning based multi-source collab-orative recommendation. In CIKM, 2017.

64

social network fusion and mining: a survey · 2018-05-03 · how to fuse: the data fusion strategy...

Documents