polyglot data management - roma tre universitytorlone/bigdata/a3-devirgilio.pdf · string cubes,...

42
Polyglot Data Management 23/05/2018 - Big Data 2018

Upload: ngotruc

Post on 23-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Polyglot Data Management23/05/2018 - Big Data 2018

Page 2: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Different types of data shapes

Data can have a variety of shapes. String cubes, graphs, relational tables, and trees are all examples of different data shapes

Page 3: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Different types of databases

Page 4: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

NoSQL Databases

„NoSQL“ term coined in 2009

Interpretation: „NotOnlySQL“

` „NoSQL“ term coined in 2009` Interpretation: „Not Only SQL“` Typical properties:

◦ Non-relational◦ Open-Source◦ Schema-less (schema-free)◦ Optimized for distribution (clusters)◦ Tunable consistency

NoSQL Databases

NoSQL-Databases.org:Current list has over 150

NoSQL systems

` „NoSQL“ term coined in 2009` Interpretation: „Not Only SQL“` Typical properties:

◦ Non-relational◦ Open-Source◦ Schema-less (schema-free)◦ Optimized for distribution (clusters)◦ Tunable consistency

NoSQL Databases

NoSQL-Databases.org:Current list has over 150

NoSQL systems

Page 5: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

NoSQL Databases

NoSQL Databases

Scalability Impedance Mismatch

?

IDCustomer

Line Item 1: …Line Item2: …

OrdersLine Items

CustomersPayment

` Two main motivations:

User-generated data,Request load

Payment: Credit Card, …

Page 6: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

NoSQL Databases

Schemafree Data Modeling

RDBMS: NoSQL DB:

SELECT Name, AgeFROM Customers

Customers

Explicitschema

Item[Price] -Item[Discount]

Implicitschema

Page 7: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

NoSQL System Classification

` Two common criteria:

NoSQL System Classification

DataModel

Consistency/AvailabilityTrade-Off

AP: Available & Partition Tolerant

CP: Consistent & Partition Tolerant

Graph

CA: Not Partition Tolerant

Document

Wide-Column

Key-Value

Page 8: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Polyglot Data Management

A polyglot data management platform gives users the flexibility to choose the right tool for the job, instead of being forced into a solution that might not be optimal.

The polyglot approach to data management must consider the capabilities associated with data as it flows through the information architecture from Acquisition to the General Core to Access.

Page 9: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Sample Scenario: a set of data containers

Page 10: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Communication between data containers

Queue

datacontainer

1

datacontainer

4

datacontainer

2

datacontainer

3GUI

Query:Q

WorkflowManager

Page 11: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Sample Scenario: technology plethora

RDBMS (PostgreSql)

Document Store (MongoDB)

Key-Value Store (Redis)

GraphDB (Neo4J)

Page 12: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Sample Scenario: query language plethora

RDBMS (SQL)

Document Store (SQL binding, Java Object)

Key-Value Store (Java Object)

GraphDB (Cypher)

Page 13: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

USE CASE

datacontainer

2

WorkflowManager

GUI

Query:Q

Query:Q1Result set:R2

Query:QQuery:Q1,Q2,Q3,Q4

Query:Q3 Query:Q4

Query:Q2Result set:R1

Result set:R3 Result set:R4

datacontainer

3

datacontainer

1

datacontainer

4

Page 14: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Sample USE CASE

Query Q: select last name of customers having rented the film with title “Titanic”

Language: SQL

SELECT C.Last_Name FROM Customer C, Rental R, Inventory I, Film F WHERE C.customer_ID = R.customer_ID AND R.inventory_ID = I.inventory_id AND I.film_id = F.film_id AND F.title = “Titanic”

Page 15: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Sample USE CASE

Page 16: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Sample USE CASE

SELECT Last_Name FROM Customer WHERE customer_id = ?

db.inventory.find( { film: { title: “Titanic”} }, {inventory_id: 1} )

Q1

MATCH (node1) WHERE node1.inventory_id = ? RETURN node1.customer_id

Q2

Q3

Page 17: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Micro-Services via Docker Containers

datacontainer

2

WorkflowManager

GUI

Query:Q

Query:Q1Result set:R2

Query:QQuery:Q1,Q2,Q3,Q4

Query:Q3 Query:Q4

Query:Q2Result set:R1

Result set:R3 Result set:R4

datacontainer

3

datacontainer

1

datacontainer

4

Page 18: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

JSON Metadata

datacontainer

2

WorkflowManager

GUI

Query:Q

Query:Q1Result set:R2

Query:QQuery:Q1,Q2,Q3,Q4

Query:Q3 Query:Q4

Query:Q2Result set:R1

Result set:R3 Result set:R4

datacontainer

3

datacontainer

1

datacontainer

4

Page 19: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

JSON Metadata

datacontainer

2

WorkflowManager

GUI

Query:Q

Query:Q1Result set:R2

Query:QQuery:Q1,Q2,Q3,Q4

Query:Q3 Query:Q4

Query:Q2Result set:R1

Result set:R3 Result set:R4

datacontainer

3

datacontainer

1

datacontainer

4

SELECT A.phone

FROM Customer C, Address A, Store R WHERE C.customer_ID = R.customer_ID AND

A.address_ID = R.address_ID AND C.first_name = “Roberto”

Page 20: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

JSON Metadata

datacontainer

2

WorkflowManager

GUI

Query:Q

Query:Q1Result set:R2

Query:QQuery:Q1,Q2,Q3,Q4

Query:Q3 Query:Q4

Query:Q2Result set:R1

Result set:R3 Result set:R4

datacontainer

3

datacontainer

1

datacontainer

4

Page 21: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

JSON Metadata

datacontainer

2

WorkflowManager

GUI

Query:Q

Query:Q1Result set:R2

Query:QQuery:Q1,Q2,Q3,Q4

Query:Q3 Query:Q4

Query:Q2Result set:R1

Result set:R3 Result set:R4

datacontainer

3

datacontainer

1

datacontainer

4

Page 22: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

JSON Metadata

datacontainer

2

WorkflowManager

GUI

Query:Q

Query:Q1Result set:R2

Query:QQuery:Q1,Q2,Q3,Q4

Query:Q3 Query:Q4

Query:Q2Result set:R1

Result set:R3 Result set:R4

datacontainer

3

datacontainer

1

datacontainer

4

Page 23: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

JSON Metadata

datacontainer

2

WorkflowManager

GUI

Query:Q

Query:Q1Result set:R2

Query:QQuery:Q1,Q2,Q3,Q4

Query:Q3 Query:Q4

Query:Q2Result set:R1

Result set:R3 Result set:R4

datacontainer

3

datacontainer

1

datacontainer

4

Page 24: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Proposals: #1

How to choose a database system?: Many Potential Candidates

problem: Polyglot database services lack the capability to automate, optimize and learn the best choice of given database systems

case study: given different data sources, let’s define a framework to detect (semi-automatically) the best fitting shape and the best corresponding technology

Page 25: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Sample USE CASE

Page 26: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Proposals: #2

Re-engineering of the existing system

problem: the actual system provides different bugs due to wrong JSON parsing and management and no usage of queue technology

case study: switch JSON parser (using a standard library) and introduce the RabbitMQ technology into the system

Page 27: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Proposals: #3

Flow: execution plan improvement algorithm

case study

inventory

chain query star query

Page 28: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Proposals: #4

Query Analysis: extended analysis of query language constructs used in the system and implementation in the query plan

case study:

SQL -> OUTER JOINS, UNION, DIFFERENCE, NESTED QUERY

Cypher -> SUB-QUERY

Document Store -> NESTED COLLECTIONS

Page 29: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Community Profiling in Social Networks

23/05/2018 - Big Data 2018

Page 30: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Community in social media

Community: It is formed by individuals such that those within a group interact with each other more frequently than with those outside the group

Community detection: discovering groups in a network where individuals’ group memberships are not explicitly given

PreC

ompu

te

Firs

tAllo

cate

Sele

ct

Upd

ate

Fluc

tuat

e

Mul

tiLev

elD

raft

Ext

ract

TransformationInitialization Construction

Col

lect

Det

ecti

on F

ram

ewor

k

DIagnosis 1

Diagnosis 2

...

Loop Conditions

Dia

gnos

es

Accuracy Outliers

QuantityEfficiency

Eva

luat

ion

Algorithms Configurations Datasets Graph Model

Setu

p

Effectiveness

OverlappingDensity

Distribution

...

Figure 1: Benchmark for community detection

modules: (1) Setup, including a set of algorithms (Sec. 2.2), real-world and synthetic datasets (Sec. 6.1), parameter configurations(Sec. 6.2), and a unified graph model converted from the datasets;(2) Detection Framework, a generalized detection procedure withhigh abstraction of the common workflow of community detection(the details of the framework are introduced in Sec. 3; the proce-dure mappings in Sec. 4); (3) Diagnoses, which provide targeteddiagnoses on these algorithms based on our framework, leadingto directions of improvement over the existing work (Sec. 5); (4)Evaluation, a comprehensive evaluation system for community de-tection from different aspects (Sec. 6.3–6.11).

The benchmark contains a universal framework which abstractsthe key factors, phases and steps from many approaches to com-munity detection tasks, and makes it easy to implement classical orlatest algorithms for comparison. Moreover, it consists of a com-prehensive suite of widely-recognized metrics for evaluation of var-ious concerned aspects, including the efficiency evaluation on thetime cost, performance evaluations on accuracy and effectiveness,sensitivity evaluations on network density and mixture degree, andadditional evaluations on community distribution and the ability toavoid excessive outliers. By modularizing and separating key fac-tors and steps, our framework allows us to study the strength andweakness of each algorithm thoroughly, and make diagnoses andtargeted prescriptions for improvement. In this benchmark we pro-vide a common code base with algorithms implemented in the sameenvironment, and thus make the comparison more fair and credible.

1.3 ContributionsWe have conducted a comprehensive benchmarking study which

focuses on the in-depth analysis, evaluation and comparison of theextensive work. To the best of our knowledge, this is the first workon the benchmarking study with a generalized framework on non-overlapping community detection techniques. We make the follow-ing main contributions:

• We propose a novel procedure-oriented framework by for-mulating a generic workflow of community detection via ab-stracting and modularizing the key factors and steps.

• We review the family of community detection approaches,and re-implement ten state-of-the-art representative algorithmsin a common code base (using standard C++) by mappingthem to the framework based on their specifics.

• We make in-depth evaluations on these approaches based onour benchmark using both real-world and synthetic datasets.

• We draw a set of interesting take-away conclusions, and pro-vide intuitive and brief ratings on concerned algorithms.

• We also present how to make diagnoses for existing approaches,leading to significant performance improvements.

The remainder of this paper is organized as follows. We formu-late the problem of community detection and sketch out existingwork in Sec. 2. In Sec. 3 we propose a universal framework forbenchmarking in community detection, and then in Sec. 4 we mapthe existing approaches to the framework. Afterwards we presenthow to make targeted diagnoses based on the framework in Sec. 5.We evaluate these approaches with our benchmark and report theresults and findings in Sec. 6, and conclude this study in Sec. 7.

2. PRELIMINARY AND BACKGROUNDAs preliminaries, we first define basic concepts and the problem

of community detection, and then review the existing approaches.

2.1 Problem DefinitionSocial Networks. A social network with n individuals and m

social ties can be denoted as G(V,E), where V is the set of nodes,|V | = n, and E is the set of undirected relationships, E ✓ V ⇥V ,|E|= m. A social network is also referred to herein as a graph.

Communities. Non-overlapping communities are not confined toa graph partition, and clusters which incompletely cover the graphare usually more desirable. Here we define the communities asa list of non-empty node subsets: Coms = {V

0

1, · · · ,V0cn}, where

Scni=1V

0i ✓ V , and cn is the total number of communities. Please

note Coms should try to satisfy V0iT

V0j = /0. A community is also

referred to as a cluster or a part.Outliers. Since community detection does not force each node

into a certain group, some independent nodes, which cannot begrouped into any communities, are allowed far outside the detectedgroups [13]. We define them as outliers: Outs = {v|v 2 V, @V

0i 2

Coms^ v 2V0i }=V �

Scni=1V

0i . It is worth mentioning that outliers

can be directly identified by original algorithms or be produced bydisbanding the tiny groups, whose sizes are less than the predefinedthreshold of minimal valid size (mvs) of communities.

PROBLEM DEFINITION 1. Generally, given a network G(V,E),and an mvs, the community detection problem aims at finding theoptimal community assignment R(Coms,Outs) from G, s.t. (1) Coms\Outs = /0 and (2) Coms[Outs = V . Herein the optimal assign-ment refers to closely connected groups of nodes (Coms) and amoderate number of disparate outliers (Outs).

v1

v4

v5

v2

v7

v12

v11

v13

v9

v8

v3 v14v6

v10

outlier

Community1Community3

Community2

v15

Figure 2: An example of community detectionAn intuitive example of the results of community detection is il-

lustrated in Fig. 2. When mvs = 2 (a general setting which meansonly singletons will be eliminated), there are three communities:Coms= {{v1,v2,v3,v4,v5,v6}, {v7,v8,v9,v10}, {v11,v12,v13,v14}},and one outlier: Outs = {v15}.

2.2 Detection AlgorithmsCommunity detection has been studied unremittingly all these

years, and a particularly large number of effective approaches havebeen proposed. In this study we focus on the fundamental problemof non-overlapping community detection, which aims at finding thedefinite group (community) that each node belongs to in the graph.We categorize the existing approaches according to the formationprocess of communities, as shown in Fig. 3 which covers most rep-resentatives from all popular approaches proposed.

999

Page 31: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Community in social media

The first kind of approaches starts from the original graph anddecomposes the entire graph to local parts gradually, trying to sep-arate out communities from the entire graph.

Division algorithms in hierarchy clustering methods, such asRadicchi [23] and Spectral [19], gradually separate the entire net-work into local parts by the edge clustering coefficient or the eigen-value of modularity matrix.

Direct partitioning methods separate the entire network intodisjoint communities. The Scalable Community Detection (SCD)algorithm [22] partitions the network by maximizing the weightedcommunity clustering [21], a recently proposed metric of commu-nity. Maximal k-Mutual-Friends (M-KMF) [29] algorithm incre-mentally filters out the connections by the number of mutual friendsbetween nodes to let the communities spontaneously emerge.

Conversely, the second kind of approaches takes a bottom-upmanner from local structures to the whole graph, and the commu-nities are formed during this process.

Label propagation methods start from local neighborhood torecognize communities automatically. The Label Propagation Al-gorithm (LPA) [24] adopts an asynchronous update strategy wherenodes join in groups under their neighbors’ choices. The HANP al-gorithm [17] based on Hop Attenuation and Node Preference adoptsadditional rules to ensure more stable and robust results.

Leadership expansion methods find communities according tolocal leader groups, since members always gather together aroundsome core nodes with high centralities to form communities. TheTopLeaders [12] algorithm gradually associates nodes to the near-est leaders and locally reelects new leaders during each iteration.

Clique percolation methods assume communities are constructedby multiple adjacent cliques. Based on the original approach [20],the Sequential Clique Percolation (SCP) [14] algorithm sequen-tially generates cliques to form connected communities.

The third kind of the approaches maintains a tree, which is amulti-level structure reorganized from the original graph, aiming atfinding communities corresponding to the branches of the tree.

Agglomeration algorithms in hierarchy clustering methods usu-ally build an explicit hierarchical tree from small clusters to largeones. Based on the Newman Fast Greedy Algorithm (NFGA) [18],Claust et al. proposed an agglomeration algorithm CNM [5] whichstarts from single nodes, maintains the change of modularity, anditeratively generates the optimal level of the hierarchy structure.

Matrix Blocking technique can also be utilized in communitydetection by constructing a hierarchy tree to order nodes in a net-work. As a representative, the Matrix Blocking Dense SubgraphExtract (MB-DSGE) algorithm [4] reorders the network, and ex-tracts dense subgraphs as communities.

Skeleton clustering methods reveal dense connected clustersbased on the skeleton of the original network, which is an efficientway of finding communities. The SCOT+HintClus algorithm [3]detects the hierarchical cluster boundaries of a network to extractthe meaningful cluster tree. Inspired by this idea, the Graph Skele-ton Clustering (gCluSkeleton) algorithm [11] projects the networkto its core-connected maximal spanning tree, and then detects theoptimized core-connected clusters on it.

3. FRAMEWORK FOR BENCHMARKINGIn this section, we present our procedure-oriented framework for

benchmarking in community detection, which consists of two fun-damental concepts abstracted from existing detection algorithmsand a generalized procedure of community detection.

3.1 Fundamental ConceptsExisting algorithms usually solve the community detection prob-

lem with various methods based on different assumptions. This

Community Detection

Constructed Tree

Communities

Hierarchy Clustering(Agglomeration)

Matrix Blocking

Skeleton Clustering

Problem-Solving Perspectives

$

Entire Graph

Communities Direct Partitioning

Hierarchy Clustering(Division)

Label Propagation

Leadership Expansion

Clique Percolation

Local Structures

Communities$

$

SketchesCategories

Figure 3: Categories of community detection approaches

makes it difficult to comparatively analyze these algorithms thor-oughly. For the sake of a better understanding of the underlyingprinciples of community detection algorithms, we abstract two fun-damental concepts, including the propinquity measure and therevelatory structure, which play critical roles in the communitydetection task and can be used to distinguish different approaches.

Definition 1. (Propinquity Measure). Given a subset M of theelements (such as nodes, relationships or other specific structures)in a graph G, the propinquity measure of M, denoted as f(M), isthe measurement of the nearness of M by the inner-connections,and is the primary criterion to estimate the priority of the elementswhen they are transformed to make the communities emerge.

Definition 2. (Revelatory Structure). Given a graph G, the rev-elatory structure corresponding to G, denoted as P, is an assistantstructure derived from G, provides yet another way to organize themassive graph elements and enlightens us on the community struc-ture from the intertwined connections among them.

The two concepts lay the basis for different community detec-tion algorithms. The former determines the tendency of groupingnodes to communities, while the latter records the gradual forma-tion of communities and leads to a more effective detection processespecially for approaches of the third category.

It should be noted that the specific definitions of the above con-cepts could be quite different in various approaches and thus weonly give a general definition here. The propinquity measure maybe the modularity [5], node centrality [12], etc., and the revelatorystructure may appear as the hierarchy tree [4, 18], G⇤ graph [14] orother particular structures. We will discuss how the two conceptsare defined specifically in different algorithms in Sec. 4.

3.2 The Generalized ProcedureWe formulate a generalized procedure of community detection

in this framework. As illustrated in the “Detection Framework”module in Fig. 1, the procedure consists of three phases, includinginitialization, transformation and construction, and characterizesthe generic workflow of community detection via a series of thekey steps. The details of the procedure are shown in Alg. 1.Phase 1: Initialization

In this phase, the graph elements need to get their initial propin-quity values and form a primary community assignment (R0

tmp). Asshown in Alg. 1, after the propinquity measure (f ) and the reve-latory structure (P) are defined (Line 1), the procedure calculatesinitial propinquity values via PRECOMPUTE (Line 3) and allocatesthe primary assignment via FIRSTALLOCATE (Line 4).Phase 2: Transformation

In this phase, the inner structures and relations underlying thenetwork elements are transformed and clarified iteratively, result-ing in a set of intermediate detection results SRtmp . We abstractthree key steps for each iteration, including SELECT, FLUCTUATEand UPDATE (Line 6–8). First of all, in SELECT, the candidate ele-ments CadT are picked out from the graph. After that, in FLUCTU-ATE, the revelatory structure P correlated with these elements will

1000

Community: It is formed by individuals such that those within a group interact with each other more frequently than with those outside the group

Community detection: discovering groups in a network where individuals’ group memberships are not explicitly given

Page 32: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Community Profiling

Positive Aspect: community membership assists us to better understand the network structure.

Negative Aspect: membership alone, without knowing what a community is and how it interacts with others, has only limited applications

Community profiling: to characterize the intrinsic nature and extrinsic behavior of a community – thereby enabling useful community-level applications.

Page 33: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Community Profiling

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

c1

c2

c3

Page 34: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Community Profiling

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

user profile

Page 35: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Community Profiling

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo

int P

rofil

ing

& D

etec

tion

(Sec

t. 3.

2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

c1

c2

c3

c1

c2

c3

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

content profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

diffusion profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

content profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

diffusion profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

content profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

diffusion profile

Page 36: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Community Profiling

c1

c2

c3

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

content profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo

int P

rofil

ing

& D

etec

tion

(Sec

t. 3.

2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

diffusion profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo

int P

rofil

ing

& D

etec

tion

(Sec

t. 3.

2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

content profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

diffusion profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3 J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

content profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

diffusion profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

Page 37: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Community Profile Modeling

c1

c2

c3

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

content profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo

int P

rofil

ing

& D

etec

tion

(Sec

t. 3.

2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

diffusion profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo

int P

rofil

ing

& D

etec

tion

(Sec

t. 3.

2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

content profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

diffusion profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3 J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

content profile

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

diffusion profile

Page 38: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Case study: TWITTER (15.000 users)

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4)

Join

t Pro

filin

g &

Det

ectio

n (S

ect.

3.2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

userID: virgafox userID: ciketto92

Page 39: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Case study: TWITTER (15.000 users)

userID: virgafox

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo

int P

rofil

ing

& D

etec

tion

(Sec

t. 3.

2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

Page 40: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Case study: TWITTER (15.000 users)

userID: virgafox

Community c1 Community c2 Community c3

Content profile of c1

Diffusion profile of c1

c1->c1 c1->c2 c1->c3

z1 z2 z3

c1 c2 c3

Diff i fil f

(a) Problem input (Sect. 3.1) (b) Problem output after inference (Sect. 4) Jo

int P

rofil

ing

& D

etec

tion

(Sec

t. 3.

2)

Friendship link Diffusion link

y 3

C t t fil f

App

licat

ion

Pred

ictio

n (S

ect.

5)

Profile-driven community ranking

Community-aware diffusion

Profile-driven community visualization

Query q (e.g., “iPhone”)

Query topics

Top K communities to diffuse q

Will retweet?

Diffusion with topic

aggregation

(c) Applications (Sect. 6)

P fil d i it i li ti

P fil d i it ki

Diffusion on a specific

topic

c1 > c2 > c3

J

Document

c1

c2

c3

communities: c1, c2 communities: c2, c3

Topic z1 Topic z2 Topic z3

Topic assignment assign

community assignment

Friendship link Diffusion link Document

Figure 1: The framework of joint community profiling and detection.

and the nonconformity of user behaviors, finding a good commu-nity profile is challenging. None of the existing work has ever iden-tified and addressed such challenges (more discussions in Sect. 2).

Our goal is to infer content profile and diffusion profile for eachcommunity, and ultimately enable new applications. In Fig. 1(a),we show the input for community profiling: a set of users, eachof whom publishes documents; users are connected by friendshiplinks, and interact with each other by diffusion links. E.g., in Twit-ter, each user posts tweets, users are connected by followershiplinks, and they retweet each other to diffuse information. In Fig. 1(b),for each community, we output: a content profile (e.g., communityc1 tends to publish topics z1 and z2) and a diffusion profile (e.g.,c1 tends to diffuse itself and c2 on z1). In Fig. 1(c), we enable threenew applications as follows (novelty to be discussed in Sect. 2,applications to be concretized in Sect. 5 and evaluated in Sect. 6):• Community-aware diffusion. As community profiles aggregateuser behaviors, we can use them to more robustly model the dif-fusion at a community level, rather than an individual level [8, 19,21]. E.g., we can explain a retweet happens as one user’s commu-nities often retweet the other’s on a certain topic. We acknowledgediffusion as a complex decision– beyond community profiles, thereare also nonconformity factors such as individual preference andtopic popularity. This partially explains why community profilingis challenging– we cannot account community profiles for all thediffusions; instead, we have to model different factors, to accu-rately estimate the profiles and the community-aware diffusion.• Profile-driven community ranking. We often need to target audi-ences for disseminating information on the networks. E.g., a com-pany wants to target communities, which are most likely to retweetabout its product, so as to launch a campaign. A funding agencywants to target communities, which actively cite papers about itsgrant theme on “deep learning”, so as to disseminate the grant call.Since we have known what content each community is interested inand how it diffuses that content with others, we can rank the com-munities. Profile-driven community ranking is different from thetraditional community recommendation, which often relies on only“community-X” properties and is unaware of diffusion [6, 12].• Profile-driven community visualization. Holistic modeling leadsto rich visualization– we can now visualize not only how communi-ties feature distinct contents (e.g., what an IT community tweets),but also how they interact (e.g., how an IT community retweetsothers) which is often overlooked before [7, 20].

We make two remarks about the above applications: 1) we com-plete one task of community profiling to support multiple applica-tions at a time, thus community profiling is only done once offline;

2) we build an interactive system1 for profile-driven community vi-sualization and ranking, which for the first time allows people tofreely browse the communities by both content and diffusion.

The difficulty of community profiling is often largely underesti-mated; as we shall discuss next, there exist many challenges:• Inter-dependency with community detection. A straightforwardapproach of community profiling is to first detect communities andthen aggregate each community’s user observations as the profiles.However, because this approach does not try to “best explain” theuser observations as generated by the communities through theirprofiles, it is often suboptimal. Take content profile as an example.Denote a user as u and a community as c. For simplicity, we denotec’s content profile as p(content|c) and the likelihood of u’s contentas p(content|u). To best explain the user content as generated bythe communities through their content profiles, we effectively solve

max!

u p(content|u) =!

u

"c p(content|c)p(c|u), (1)

where p(c|u) is the probability of user u assigned to communityc. Ideally, to optimize Eq. 1, we shall optimize both the profilep(content|c)’s and the community assignment p(c|u)’s. But in thestraightforward approach, the detection first fixes the p(c|u)’s, thenthe best result this aggregation can return is the p(content|c)’s thatmaximize Eq. 1. It is clear that, the maximal likelihood we getwith fixed p(c|u)’s is suboptimal, unless the p(c|u)’s are “perfect”.However, a perfect detection of p(c|u)’s also needs to maximizethe likelihood in Eq. 1, which depends on the profile p(content|c)’s.In all, content profiles and community detection are coupled. Wefurther show in our technical report [3] that diffusion profiles andcommunity detection are also coupled.• Heterogeneity of social observations. Social observations, espe-cially the user links (i.e., friendship links and diffusion links), oftencarry different semantics; e.g., friendship links indicate user con-nections and diffusion links indicate user interactions. Tradition-ally, we often try to enforce user connections to be denser withineach community than across communities [14, 17]. But in diffu-sion, the “weak ties” theory recognizes that the inter-communityinteractions may not be weak [10]. E.g., software engineering com-munity cites more papers from machine learning community thanitself on “deep learning”. This means we have to separate the mod-eling of user connection and user diffusion. Such user link hetero-geneity is largely overlooked in the previous work [27, 30], thushow to model heterogeneous user links together remains unclear.• Nonconformity of user behaviors. User behaviors, especiallytheir diffusion decisions, can happen for many reasons. Community-level conformity is just one reason, thus we have to consider other

1http://sociallens.adsc.com.sg/

818

Page 41: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Proposals: #1

Community Evolution Prediction: given a set of candidate communities and corresponding community profiles, let’s detect a methodology for predicting the evolution of actual communities (and future communities)

Page 42: Polyglot Data Management - Roma Tre Universitytorlone/bigdata/A3-DeVirgilio.pdf · String cubes, graphs, ... GraphDB (Neo4J) Sample Scenario: ... and makes it easy to implement classical

Proposals: #2

Profile Modeling: represent a user profile in terms of a knowledge graph referring to a different ontology

case study:

web ontology: Schema.org

social network: Twitter (15000 userIDs)

DBLP: computer science bibliography