polyglot data management - roma tre universitytorlone/bigdata/a3-devirgilio.pdf · string cubes,...
TRANSCRIPT
Polyglot Data Management23/05/2018 - Big Data 2018
Different types of data shapes
Data can have a variety of shapes. String cubes, graphs, relational tables, and trees are all examples of different data shapes
Different types of databases
NoSQL Databases
„NoSQL“ term coined in 2009
Interpretation: „NotOnlySQL“
` „NoSQL“ term coined in 2009` Interpretation: „Not Only SQL“` Typical properties:
◦ Non-relational◦ Open-Source◦ Schema-less (schema-free)◦ Optimized for distribution (clusters)◦ Tunable consistency
NoSQL Databases
NoSQL-Databases.org:Current list has over 150
NoSQL systems
` „NoSQL“ term coined in 2009` Interpretation: „Not Only SQL“` Typical properties:
◦ Non-relational◦ Open-Source◦ Schema-less (schema-free)◦ Optimized for distribution (clusters)◦ Tunable consistency
NoSQL Databases
NoSQL-Databases.org:Current list has over 150
NoSQL systems
NoSQL Databases
NoSQL Databases
Scalability Impedance Mismatch
?
IDCustomer
Line Item 1: …Line Item2: …
OrdersLine Items
CustomersPayment
` Two main motivations:
User-generated data,Request load
Payment: Credit Card, …
NoSQL Databases
Schemafree Data Modeling
RDBMS: NoSQL DB:
SELECT Name, AgeFROM Customers
Customers
Explicitschema
Item[Price] -Item[Discount]
Implicitschema
NoSQL System Classification
` Two common criteria:
NoSQL System Classification
DataModel
Consistency/AvailabilityTrade-Off
AP: Available & Partition Tolerant
CP: Consistent & Partition Tolerant
Graph
CA: Not Partition Tolerant
Document
Wide-Column
Key-Value
Polyglot Data Management
A polyglot data management platform gives users the flexibility to choose the right tool for the job, instead of being forced into a solution that might not be optimal.
The polyglot approach to data management must consider the capabilities associated with data as it flows through the information architecture from Acquisition to the General Core to Access.
Sample Scenario: a set of data containers
Communication between data containers
Queue
datacontainer
1
datacontainer
4
datacontainer
2
datacontainer
3GUI
Query:Q
WorkflowManager
Sample Scenario: technology plethora
RDBMS (PostgreSql)
Document Store (MongoDB)
Key-Value Store (Redis)
GraphDB (Neo4J)
Sample Scenario: query language plethora
RDBMS (SQL)
Document Store (SQL binding, Java Object)
Key-Value Store (Java Object)
GraphDB (Cypher)
USE CASE
datacontainer
2
WorkflowManager
GUI
Query:Q
Query:Q1Result set:R2
Query:QQuery:Q1,Q2,Q3,Q4
Query:Q3 Query:Q4
Query:Q2Result set:R1
Result set:R3 Result set:R4
datacontainer
3
datacontainer
1
datacontainer
4
Sample USE CASE
Query Q: select last name of customers having rented the film with title “Titanic”
Language: SQL
SELECT C.Last_Name FROM Customer C, Rental R, Inventory I, Film F WHERE C.customer_ID = R.customer_ID AND R.inventory_ID = I.inventory_id AND I.film_id = F.film_id AND F.title = “Titanic”
Sample USE CASE
Sample USE CASE
SELECT Last_Name FROM Customer WHERE customer_id = ?
db.inventory.find( { film: { title: “Titanic”} }, {inventory_id: 1} )
Q1
MATCH (node1) WHERE node1.inventory_id = ? RETURN node1.customer_id
Q2
Q3
Micro-Services via Docker Containers
datacontainer
2
WorkflowManager
GUI
Query:Q
Query:Q1Result set:R2
Query:QQuery:Q1,Q2,Q3,Q4
Query:Q3 Query:Q4
Query:Q2Result set:R1
Result set:R3 Result set:R4
datacontainer
3
datacontainer
1
datacontainer
4
JSON Metadata
datacontainer
2
WorkflowManager
GUI
Query:Q
Query:Q1Result set:R2
Query:QQuery:Q1,Q2,Q3,Q4
Query:Q3 Query:Q4
Query:Q2Result set:R1
Result set:R3 Result set:R4
datacontainer
3
datacontainer
1
datacontainer
4
JSON Metadata
datacontainer
2
WorkflowManager
GUI
Query:Q
Query:Q1Result set:R2
Query:QQuery:Q1,Q2,Q3,Q4
Query:Q3 Query:Q4
Query:Q2Result set:R1
Result set:R3 Result set:R4
datacontainer
3
datacontainer
1
datacontainer
4
SELECT A.phone
FROM Customer C, Address A, Store R WHERE C.customer_ID = R.customer_ID AND
A.address_ID = R.address_ID AND C.first_name = “Roberto”
JSON Metadata
datacontainer
2
WorkflowManager
GUI
Query:Q
Query:Q1Result set:R2
Query:QQuery:Q1,Q2,Q3,Q4
Query:Q3 Query:Q4
Query:Q2Result set:R1
Result set:R3 Result set:R4
datacontainer
3
datacontainer
1
datacontainer
4
JSON Metadata
datacontainer
2
WorkflowManager
GUI
Query:Q
Query:Q1Result set:R2
Query:QQuery:Q1,Q2,Q3,Q4
Query:Q3 Query:Q4
Query:Q2Result set:R1
Result set:R3 Result set:R4
datacontainer
3
datacontainer
1
datacontainer
4
JSON Metadata
datacontainer
2
WorkflowManager
GUI
Query:Q
Query:Q1Result set:R2
Query:QQuery:Q1,Q2,Q3,Q4
Query:Q3 Query:Q4
Query:Q2Result set:R1
Result set:R3 Result set:R4
datacontainer
3
datacontainer
1
datacontainer
4
JSON Metadata
datacontainer
2
WorkflowManager
GUI
Query:Q
Query:Q1Result set:R2
Query:QQuery:Q1,Q2,Q3,Q4
Query:Q3 Query:Q4
Query:Q2Result set:R1
Result set:R3 Result set:R4
datacontainer
3
datacontainer
1
datacontainer
4
Proposals: #1
How to choose a database system?: Many Potential Candidates
problem: Polyglot database services lack the capability to automate, optimize and learn the best choice of given database systems
case study: given different data sources, let’s define a framework to detect (semi-automatically) the best fitting shape and the best corresponding technology
Sample USE CASE
Proposals: #2
Re-engineering of the existing system
problem: the actual system provides different bugs due to wrong JSON parsing and management and no usage of queue technology
case study: switch JSON parser (using a standard library) and introduce the RabbitMQ technology into the system
Proposals: #3
Flow: execution plan improvement algorithm
case study
inventory
chain query star query
Proposals: #4
Query Analysis: extended analysis of query language constructs used in the system and implementation in the query plan
case study:
SQL -> OUTER JOINS, UNION, DIFFERENCE, NESTED QUERY
Cypher -> SUB-QUERY
Document Store -> NESTED COLLECTIONS
Community Profiling in Social Networks
23/05/2018 - Big Data 2018
Community in social media
Community: It is formed by individuals such that those within a group interact with each other more frequently than with those outside the group
Community detection: discovering groups in a network where individuals’ group memberships are not explicitly given
PreC
ompu
te
Firs
tAllo
cate
Sele
ct
Upd
ate
Fluc
tuat
e
Mul
tiLev
elD
raft
Ext
ract
TransformationInitialization Construction
Col
lect
Det
ecti
on F
ram
ewor
k
DIagnosis 1
Diagnosis 2
...
Loop Conditions
Dia
gnos
es
Accuracy Outliers
QuantityEfficiency
Eva
luat
ion
Algorithms Configurations Datasets Graph Model
Setu
p
Effectiveness
OverlappingDensity
Distribution
...
Figure 1: Benchmark for community detection
modules: (1) Setup, including a set of algorithms (Sec. 2.2), real-world and synthetic datasets (Sec. 6.1), parameter configurations(Sec. 6.2), and a unified graph model converted from the datasets;(2) Detection Framework, a generalized detection procedure withhigh abstraction of the common workflow of community detection(the details of the framework are introduced in Sec. 3; the proce-dure mappings in Sec. 4); (3) Diagnoses, which provide targeteddiagnoses on these algorithms based on our framework, leadingto directions of improvement over the existing work (Sec. 5); (4)Evaluation, a comprehensive evaluation system for community de-tection from different aspects (Sec. 6.3–6.11).
The benchmark contains a universal framework which abstractsthe key factors, phases and steps from many approaches to com-munity detection tasks, and makes it easy to implement classical orlatest algorithms for comparison. Moreover, it consists of a com-prehensive suite of widely-recognized metrics for evaluation of var-ious concerned aspects, including the efficiency evaluation on thetime cost, performance evaluations on accuracy and effectiveness,sensitivity evaluations on network density and mixture degree, andadditional evaluations on community distribution and the ability toavoid excessive outliers. By modularizing and separating key fac-tors and steps, our framework allows us to study the strength andweakness of each algorithm thoroughly, and make diagnoses andtargeted prescriptions for improvement. In this benchmark we pro-vide a common code base with algorithms implemented in the sameenvironment, and thus make the comparison more fair and credible.
1.3 ContributionsWe have conducted a comprehensive benchmarking study which
focuses on the in-depth analysis, evaluation and comparison of theextensive work. To the best of our knowledge, this is the first workon the benchmarking study with a generalized framework on non-overlapping community detection techniques. We make the follow-ing main contributions:
• We propose a novel procedure-oriented framework by for-mulating a generic workflow of community detection via ab-stracting and modularizing the key factors and steps.
• We review the family of community detection approaches,and re-implement ten state-of-the-art representative algorithmsin a common code base (using standard C++) by mappingthem to the framework based on their specifics.
• We make in-depth evaluations on these approaches based onour benchmark using both real-world and synthetic datasets.
• We draw a set of interesting take-away conclusions, and pro-vide intuitive and brief ratings on concerned algorithms.
• We also present how to make diagnoses for existing approaches,leading to significant performance improvements.
The remainder of this paper is organized as follows. We formu-late the problem of community detection and sketch out existingwork in Sec. 2. In Sec. 3 we propose a universal framework forbenchmarking in community detection, and then in Sec. 4 we mapthe existing approaches to the framework. Afterwards we presenthow to make targeted diagnoses based on the framework in Sec. 5.We evaluate these approaches with our benchmark and report theresults and findings in Sec. 6, and conclude this study in Sec. 7.
2. PRELIMINARY AND BACKGROUNDAs preliminaries, we first define basic concepts and the problem
of community detection, and then review the existing approaches.
2.1 Problem DefinitionSocial Networks. A social network with n individuals and m
social ties can be denoted as G(V,E), where V is the set of nodes,|V | = n, and E is the set of undirected relationships, E ✓ V ⇥V ,|E|= m. A social network is also referred to herein as a graph.
Communities. Non-overlapping communities are not confined toa graph partition, and clusters which incompletely cover the graphare usually more desirable. Here we define the communities asa list of non-empty node subsets: Coms = {V
0
1, · · · ,V0cn}, where
Scni=1V
0i ✓ V , and cn is the total number of communities. Please
note Coms should try to satisfy V0iT
V0j = /0. A community is also
referred to as a cluster or a part.Outliers. Since community detection does not force each node
into a certain group, some independent nodes, which cannot begrouped into any communities, are allowed far outside the detectedgroups [13]. We define them as outliers: Outs = {v|v 2 V, @V
0i 2
Coms^ v 2V0i }=V �
Scni=1V
0i . It is worth mentioning that outliers
can be directly identified by original algorithms or be produced bydisbanding the tiny groups, whose sizes are less than the predefinedthreshold of minimal valid size (mvs) of communities.
PROBLEM DEFINITION 1. Generally, given a network G(V,E),and an mvs, the community detection problem aims at finding theoptimal community assignment R(Coms,Outs) from G, s.t. (1) Coms\Outs = /0 and (2) Coms[Outs = V . Herein the optimal assign-ment refers to closely connected groups of nodes (Coms) and amoderate number of disparate outliers (Outs).
v1
v4
v5
v2
v7
v12
v11
v13
v9
v8
v3 v14v6
v10
outlier
Community1Community3
Community2
v15
Figure 2: An example of community detectionAn intuitive example of the results of community detection is il-
lustrated in Fig. 2. When mvs = 2 (a general setting which meansonly singletons will be eliminated), there are three communities:Coms= {{v1,v2,v3,v4,v5,v6}, {v7,v8,v9,v10}, {v11,v12,v13,v14}},and one outlier: Outs = {v15}.
2.2 Detection AlgorithmsCommunity detection has been studied unremittingly all these
years, and a particularly large number of effective approaches havebeen proposed. In this study we focus on the fundamental problemof non-overlapping community detection, which aims at finding thedefinite group (community) that each node belongs to in the graph.We categorize the existing approaches according to the formationprocess of communities, as shown in Fig. 3 which covers most rep-resentatives from all popular approaches proposed.
999
Community in social media
The first kind of approaches starts from the original graph anddecomposes the entire graph to local parts gradually, trying to sep-arate out communities from the entire graph.
Division algorithms in hierarchy clustering methods, such asRadicchi [23] and Spectral [19], gradually separate the entire net-work into local parts by the edge clustering coefficient or the eigen-value of modularity matrix.
Direct partitioning methods separate the entire network intodisjoint communities. The Scalable Community Detection (SCD)algorithm [22] partitions the network by maximizing the weightedcommunity clustering [21], a recently proposed metric of commu-nity. Maximal k-Mutual-Friends (M-KMF) [29] algorithm incre-mentally filters out the connections by the number of mutual friendsbetween nodes to let the communities spontaneously emerge.
Conversely, the second kind of approaches takes a bottom-upmanner from local structures to the whole graph, and the commu-nities are formed during this process.
Label propagation methods start from local neighborhood torecognize communities automatically. The Label Propagation Al-gorithm (LPA) [24] adopts an asynchronous update strategy wherenodes join in groups under their neighbors’ choices. The HANP al-gorithm [17] based on Hop Attenuation and Node Preference adoptsadditional rules to ensure more stable and robust results.
Leadership expansion methods find communities according tolocal leader groups, since members always gather together aroundsome core nodes with high centralities to form communities. TheTopLeaders [12] algorithm gradually associates nodes to the near-est leaders and locally reelects new leaders during each iteration.
Clique percolation methods assume communities are constructedby multiple adjacent cliques. Based on the original approach [20],the Sequential Clique Percolation (SCP) [14] algorithm sequen-tially generates cliques to form connected communities.
The third kind of the approaches maintains a tree, which is amulti-level structure reorganized from the original graph, aiming atfinding communities corresponding to the branches of the tree.
Agglomeration algorithms in hierarchy clustering methods usu-ally build an explicit hierarchical tree from small clusters to largeones. Based on the Newman Fast Greedy Algorithm (NFGA) [18],Claust et al. proposed an agglomeration algorithm CNM [5] whichstarts from single nodes, maintains the change of modularity, anditeratively generates the optimal level of the hierarchy structure.
Matrix Blocking technique can also be utilized in communitydetection by constructing a hierarchy tree to order nodes in a net-work. As a representative, the Matrix Blocking Dense SubgraphExtract (MB-DSGE) algorithm [4] reorders the network, and ex-tracts dense subgraphs as communities.
Skeleton clustering methods reveal dense connected clustersbased on the skeleton of the original network, which is an efficientway of finding communities. The SCOT+HintClus algorithm [3]detects the hierarchical cluster boundaries of a network to extractthe meaningful cluster tree. Inspired by this idea, the Graph Skele-ton Clustering (gCluSkeleton) algorithm [11] projects the networkto its core-connected maximal spanning tree, and then detects theoptimized core-connected clusters on it.
3. FRAMEWORK FOR BENCHMARKINGIn this section, we present our procedure-oriented framework for
benchmarking in community detection, which consists of two fun-damental concepts abstracted from existing detection algorithmsand a generalized procedure of community detection.
3.1 Fundamental ConceptsExisting algorithms usually solve the community detection prob-
lem with various methods based on different assumptions. This
Community Detection
Constructed Tree
Communities
Hierarchy Clustering(Agglomeration)
Matrix Blocking
Skeleton Clustering
Problem-Solving Perspectives
$
Entire Graph
Communities Direct Partitioning
Hierarchy Clustering(Division)
Label Propagation
Leadership Expansion
Clique Percolation
Local Structures
Communities$
$
SketchesCategories
Figure 3: Categories of community detection approaches
makes it difficult to comparatively analyze these algorithms thor-oughly. For the sake of a better understanding of the underlyingprinciples of community detection algorithms, we abstract two fun-damental concepts, including the propinquity measure and therevelatory structure, which play critical roles in the communitydetection task and can be used to distinguish different approaches.
Definition 1. (Propinquity Measure). Given a subset M of theelements (such as nodes, relationships or other specific structures)in a graph G, the propinquity measure of M, denoted as f(M), isthe measurement of the nearness of M by the inner-connections,and is the primary criterion to estimate the priority of the elementswhen they are transformed to make the communities emerge.
Definition 2. (Revelatory Structure). Given a graph G, the rev-elatory structure corresponding to G, denoted as P, is an assistantstructure derived from G, provides yet another way to organize themassive graph elements and enlightens us on the community struc-ture from the intertwined connections among them.
The two concepts lay the basis for different community detec-tion algorithms. The former determines the tendency of groupingnodes to communities, while the latter records the gradual forma-tion of communities and leads to a more effective detection processespecially for approaches of the third category.
It should be noted that the specific definitions of the above con-cepts could be quite different in various approaches and thus weonly give a general definition here. The propinquity measure maybe the modularity [5], node centrality [12], etc., and the revelatorystructure may appear as the hierarchy tree [4, 18], G⇤ graph [14] orother particular structures. We will discuss how the two conceptsare defined specifically in different algorithms in Sec. 4.
3.2 The Generalized ProcedureWe formulate a generalized procedure of community detection
in this framework. As illustrated in the “Detection Framework”module in Fig. 1, the procedure consists of three phases, includinginitialization, transformation and construction, and characterizesthe generic workflow of community detection via a series of thekey steps. The details of the procedure are shown in Alg. 1.Phase 1: Initialization
In this phase, the graph elements need to get their initial propin-quity values and form a primary community assignment (R0
tmp). Asshown in Alg. 1, after the propinquity measure (f ) and the reve-latory structure (P) are defined (Line 1), the procedure calculatesinitial propinquity values via PRECOMPUTE (Line 3) and allocatesthe primary assignment via FIRSTALLOCATE (Line 4).Phase 2: Transformation
In this phase, the inner structures and relations underlying thenetwork elements are transformed and clarified iteratively, result-ing in a set of intermediate detection results SRtmp . We abstractthree key steps for each iteration, including SELECT, FLUCTUATEand UPDATE (Line 6–8). First of all, in SELECT, the candidate ele-ments CadT are picked out from the graph. After that, in FLUCTU-ATE, the revelatory structure P correlated with these elements will
1000
Community: It is formed by individuals such that those within a group interact with each other more frequently than with those outside the group
Community detection: discovering groups in a network where individuals’ group memberships are not explicitly given
Community Profiling
Positive Aspect: community membership assists us to better understand the network structure.
Negative Aspect: membership alone, without knowing what a community is and how it interacts with others, has only limited applications
Community profiling: to characterize the intrinsic nature and extrinsic behavior of a community – thereby enabling useful community-level applications.
Community Profiling
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
c1
c2
c3
Community Profiling
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
user profile
Community Profiling
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo
int P
rofil
ing
& D
etec
tion
(Sec
t. 3.
2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
c1
c2
c3
c1
c2
c3
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
content profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
diffusion profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
content profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
diffusion profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
content profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
diffusion profile
Community Profiling
c1
c2
c3
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
content profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo
int P
rofil
ing
& D
etec
tion
(Sec
t. 3.
2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
diffusion profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo
int P
rofil
ing
& D
etec
tion
(Sec
t. 3.
2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
content profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
diffusion profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3 J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
content profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
diffusion profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
Community Profile Modeling
c1
c2
c3
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
content profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo
int P
rofil
ing
& D
etec
tion
(Sec
t. 3.
2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
diffusion profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo
int P
rofil
ing
& D
etec
tion
(Sec
t. 3.
2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
content profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
diffusion profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3 J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
content profile
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
diffusion profile
Case study: TWITTER (15.000 users)
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)
Join
t Pro
filin
g &
Det
ectio
n (S
ect.
3.2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
userID: virgafox userID: ciketto92
Case study: TWITTER (15.000 users)
userID: virgafox
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo
int P
rofil
ing
& D
etec
tion
(Sec
t. 3.
2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
Case study: TWITTER (15.000 users)
userID: virgafox
Community c1 Community c2 Community c3
Content profile of c1
Diffusion profile of c1
c1->c1 c1->c2 c1->c3
z1 z2 z3
c1 c2 c3
Diff i fil f
(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo
int P
rofil
ing
& D
etec
tion
(Sec
t. 3.
2)
Friendship link Diffusion link
y 3
C t t fil f
App
licat
ion
Pred
ictio
n (S
ect.
5)
Profile-driven community ranking
Community-aware diffusion
Profile-driven community visualization
Query q (e.g., “iPhone”)
Query topics
Top K communities to diffuse q
Will retweet?
Diffusion with topic
aggregation
(c) Applications (Sect. 6)
P fil d i it i li ti
P fil d i it ki
Diffusion on a specific
topic
c1 > c2 > c3
J
Document
c1
c2
c3
communities: c1, c2 communities: c2, c3
Topic z1 Topic z2 Topic z3
Topic assignment assign
community assignment
Friendship link Diffusion link Document
Figure 1: The framework of joint community profiling and detection.
and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).
Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].
We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;
2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.
The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve
max!
u p(content|u) =!
u
"c p(content|c)p(c|u), (1)
where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other
1http://sociallens.adsc.com.sg/
818
Proposals: #1
Community Evolution Prediction: given a set of candidate communities and corresponding community profiles, let’s detect a methodology for predicting the evolution of actual communities (and future communities)
Proposals: #2
Profile Modeling: represent a user profile in terms of a knowledge graph referring to a different ontology
case study:
web ontology: Schema.org
social network: Twitter (15000 userIDs)
DBLP: computer science bibliography