community detection based on social interactions in a social network

Community Detection Based on Social Interactions ina Social Network

Yen-Liang Chen, Ching-Hao Chuang, and Yu-Ting ChiuDepartment of Information Management, National Central University, Chung-Li 320, Taoyuang, Taiwan, R.O.C.E-mail: [email protected], 994203005, [email protected]

Recent research has involved identifying communitiesin networks. Traditional methods of community detec-tion usually assume that the network’s structural infor-mation is fully known, which is not the case in manypractical networks. Moreover, most previous communitydetection algorithms do not differentiate multiple rela-tionships between objects or persons in the real world.In this article, we propose a new approach that utilizessocial interaction data (e.g., users’ posts on Facebook)to address the community detection problem in Face-book and to find the multiple social groups of a Face-book user. Some advantages to our approach are (a) itdoes not depend on structural information, (b) it differ-entiates the various relationships that exist amongfriends, and (c) it can discover a target user’s multiplecommunities. In the experiment, we detect the commu-nity distribution of Facebook users using the proposedmethod. The experiment shows that our method canachieve the result of having the average scores ofTotal-Community-Purity and Total-Cluster-Purity both atapproximately 0.8.

Introduction

In recent years, many real-world networks, such as theWorld Wide Web (Albert, Jeong, & Barabási, 1999), socialnetworks (Li, Foo, Tew, & Ng, 2009; Wasserman & Faust,1994), biological networks (Li, Foo, & Ng, 2007; Li, Tan,Foo, & Ng, 2005; Li, Wu, Kwoh, & Ng, 2010; Palla,Derényi, Farkas, & Vicsek, 2005; Steinhaeuser & Chawla,2009; Wu, Li, Kwoh, & Ng, 2009), citation networks(Redner, 1998), and communication networks (Nisheeth,Anirban, & Rastogi, 2008) have become available for datamining. A key task in mining these networks is to find theunderlying communities, where a community is a group ofpeople or objects that share some common interests. There-fore, finding and identifying subgroups in a heterogeneous

social network is defined as a “community detectionproblem.” Community detection in complex social networkshas recently attracted a significant amount of attention.

Community detection can reveal important functionalinformation about real-world networks. For example, com-munities in biological networks usually correspond to func-tional modules or biological pathways that are useful inunderstanding the causes of various diseases (Steinhaeuser& Chawla, 2009). In social networks, knowledge about theunderlying community substructures can be used in search-ing for potential collaborators, devising strategies to opti-mize social relationships, or identifying key persons in thevarious communities, and so on.

There are many algorithms for community detection.They are mostly structure-based methods and make the fol-lowing assumptions: First, they assume that the structure ofthe network is known; in other words, the relationshipswithin the community are fully known, without any missinginformation. Second, most previous algorithms for commu-nity detection only consider a single type of relationship.They do not differentiate the various relationships that mayexist between objects or persons in the real world.

Problems occur when we attempt to identify hidden com-munities in a social network with these assumptions. In asocial network, many relationships are missing or notrecorded. On Facebook, for example, not all members of acommunity are fully connected with each other. A memberusually connects with only a few others he/she likes, creat-ing many missing links within the community. In addition, itis common for a person to belong to different groups simul-taneously, due to different backgrounds and interests. Mostexisting methods, however, do not take into account differ-ent relationships, so they cannot accurately identify differentrelations between people.

We can use a Facebook friends’ network as a more spe-cific example. Assume 10 people are in the same college,and another eight people are in the same high school, withthree people who simultaneously belonging to both groups.Their Facebook relationship network is shown in Figure 1.

Received September 25, 2012; revised April 3, 2013; accepted April 4, 2013

© 2014 ASIS&T • Published online 7 January 2014 in Wiley OnlineLibrary (wileyonlinelibrary.com). DOI: 10.1002/asi.22986

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 65(3):539–550, 2014

mailto:[email protected]

mailto:994203005, [email protected]

http://wileyonlinelibrary.com

Due to the low number of connections (represented bythe dotted lines) between some members, not every personin this community would put everyone on her/his friend list.Furthermore, due to the single relation arc, these two typesof relations (i.e., college and high school) are treated as thesame. This would create unclear boundaries between com-munities if someone belonged to both groups (representedby the nodes inside the red circle).

In light of the above weaknesses, our research attempts todetect communities more accurately under the Facebookconstraints of missing link information and no differentia-tion in types of arcs (one relation). These two constraintsreflect the reality of current social networks, such as Face-book, LinkedIn, and others.

This article proposes a new, social interaction–basedapproach to address the Facebook community detectionproblem, where social interaction refers to users communi-cating and interacting with others, such as posting, reading,liking, replying, and so on. Our approach is based on severalspeculations. Members of the same community are morelikely to interact with each other via social media than withthose in different communities. In other words, people in thesame community have a higher chance of being involved inthe same post and interacting with each other than with thosein different communities. We use Facebook in our discus-sion of how to identify communities. Our approach extendsthe existing community detection method, which is astructure-based approach, with social interactions amongusers to detect real-world communities. In our method, morefeatures are considered than the classic community-detection problems. The additional features include postinformation, personal information, interpersonal relations,and so on. We operate on a much broader domain of inputdata than what classic “community detection”. Our algo-rithm contains two phases. The first phase generates groupsas initial communities using frequent pattern mining. Thesecond is a merging phase, which considers whether two

groups should be merged into a bigger one using thedesigned similarity measures.

Background and Related Work

Many community detection methods have been devel-oped. According to the recent classification schemesproposed by Newman (2004), Fortunato (2010), andPapadopoulos, Kompatsiaris, Vakali, and Spyridonos(2012), these methods can be classified into six categories:spectral and clustering methods, divisive algorithms,modularity-based methods, model-based methods, localcommunity detection methods, and feature-based assistedmethods.

Spectral and Clustering Methods

The spectral and clustering methods are classified as themethods for cohesive subgraph discovery and vertex clus-tering in the research of Papadopoulos et al. (2012). Themethods in this category are similar to traditional methods.In social network analysis (SNA) research, a communityis often considered a group of cohesive substructures (Scott,2002; Wasserman & Faust, 1994), such as cliques (Bron &Kerbosch, 1973; Du, Wu, & Wang, 2006), n-cliques,n-clans, n-plexes, k-core (Wu & Pei, 2007), and quasi-cliques (Abello, Resende, & Sudarsky, 2002; Pei, Jiang, &Zhang, 2006; Zeng, Wang, & Karypis, 2006). An n-cliquemeans the n nodes connect to each other, whereas a quasi-clique means the number of each vertex’s neighbors shouldexceed a proportional threshold. Spectral and clusteringmethods are used to find substructures within the entirenetwork. The size of these substructures is usually small, butthe number is great, which hides the global organization ofa given network. The clustering methods originate fromtraditional data clustering methodology. It is a widely usedtechnique that clusters similar vertices into larger commu-nities in SNA (Han & Kamber, 2006). In hierarchical clus-tering, clusters are merged or split using specified criterions,such as agglomerative methods based on structural similar-ity metrics and divisive methods based on betweennessmetrics. Donetti and Munoz (2004) proposed a method thattreated the Laplacian eigenvectors of a graph as a similaritymeasurement among vertices and used the agglomerativeprocess for community detection.

Divisive Algorithms

Divisive algorithms identify and remove edges or verticesbetween communities via measures such as betweenness.Girvan and Newman (2002) introduced the GN algorithm,which is one of the most important algorithms in communitydetection. The GN algorithm repeatedly computes between-ness for all edges and removes the edge with the highestscore. Girvan and Newman (2004) also introduced a divisiveapproach that removed edges depending on their between-ness values; they iteratively cut the edge with the highest

FIG. 1. Example of friends’ networks on Facebook. Ten people are in thesame college, and another eight people are in the same high school, withthree people who simultaneously belong to both groups. [Color figure canbe viewed in the online issue, which is available at wileyonlinelibrary.com.]

540 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—March 2014DOI: 10.1002/asi

betweenness value and then used Network Modularity Q toobtain an optimized division of the network. Radicchi,Castellano, Cecconi, Loreto, and Parisi (2004) proposed amethod similar to GN, but used the edge-clustering coeffi-cient as the new metric. Another clustering algorithm pro-posed by Pons and Latapy (2005) utilized the random walkmethod to measure the similarity between vertices. It alsoused Network Modularity Q to determine when to stop theagglomerative process.

Modularity-Based Methods

The modularity-based methods are classified as themethods for community quality optimization in the researchof Papadopoulos et al. (2012). The methods in this categorydesign optimal graph-based measures to detect communityquality. Clauset, Newman, and Moore (2004) proposed a fastclustering algorithm on a sparse graph that uses a greedystrategy to get a maximal DQ by merging pairs of nodesiteratively until it becomes negative. In the work of Ressand Gallagher (2012), the proposed algorithm first buildsan egonet as a friendship group for each node. Then itcompares all the egonets to find overlaps and then uses theoverlaps to form a whole picture of the network with identi-fied communities.

Model-Based Methods

The model-based methods are classified as the methodsof dynamic algorithms and the methods based on statisticalinference in the research of Fortunato (2010). Qiu andLin (2011) provided a tree learning algorithm combiningPageRank and Random Walk to explore constantly evolv-ing structures in an organization. They detected staticcommunity trees in each period of time to distinguishtransiting relations (including splitting, merging, evolving,and emerging) in an organization’s dynamic status overtime. Hastings (2006) viewed the community detectionproblem as a statistical inference problem, and appliedbelief propagation and mean-field theory to solve theproblem.

Local Community Detection Methods

Besides the above methods, there are still other methods(Newman, 2004). Wu and Huberman (2003) introduced amethod using the idea of the electrical circuit. A unit resistorcan be viewed as a link connecting different communities ina network. This method can also be used to detect a particu-lar community for a specified vertex without searching allthe community structure within that network in advance.Local community detection is an issue related to communitydetection. In local community detection, a query node is thestarting point to find its local community (or communities)through adding adjacent nodes to the community gradually(Branting, 2012; Chen, Zaiane, & Goebel, 2009). Our

research belongs to this category because our method dis-covers the communities from the perspective of a targetperson.

Feature-Based Assisted Methods

The feature-based assisted methods, as implied in thename, are a type of method using additional features to assistthe job of community detection. The methods use specialfeatures of objects to recognize hidden relations amongthem and discover the community structure of these objects.According to Asratian, Denley, and Häggkvist (1998), abipartite graph can be built, as shown in Figure 2a, whereno two vertices in the same subset are adjacent(i.e., having a link). From the bipartite graph, a hiddenconnection can be recognized according to Menger’sTheorem. Using Figure 2b as an example, node X1 and nodeX3 have a hidden connection (dotted line) because both ofthem link to node X2.

According to Wasserman and Faust (1994), the bipartitegraph of an affiliation network is similar to the conceptmentioned in the work of Asratian et al. (1998). In thisaffiliation network, the value of co-membership can be com-puted to represent how close two actors are. The value is setto the number of times or the frequency that two actors jointhe same event. Using Figure 3 as an example, the value ofco-membership(Allison, Drew) equals 0 and the value ofco-membership(Allison, Eliot) equals 1.

Because it is intended to detect community structurethrough the behaviors of posts and responses on Facebook,we adopt the concept of co-membership to analyze the rela-tions of posting behaviors among users to recognize hiddenrelations. The strengths of hidden relations among users arehelpful to detect hidden communities for a target user. Thedetails are described as finding interaction transactions andcalculating their support values in the Method section.

Summary of Related Work

Although previous algorithms have been successfullyapplied to community detection in different backgrounds

FIG. 2. Illustration of a bipartite graph. It is an excerpt from the work ofAsratian et al. (1998). In this figure, two subsets of nodes which are notadjacent construct a bipartite graph. Nodes x1, x3, and x5 belong to onesubset and nodes x2, x4, and x6 belong to another subset. [Color figure canbe viewed in the online issue, which is available at wileyonlinelibrary.com.]

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—March 2014 541DOI: 10.1002/asi

and applications, these approaches work based on theassumption that structural information in the network isfully known. Unfortunately, not all link relationship infor-mation is recorded in a social network. Additionally, pastwork paid little attention to the issue of identifying multire-lation links. In reality, however, it is very common for peopleto have multiple relationships within their social networks.Without addressing this situation, communities cannot beidentified accurately.

To summarize, there are five interesting characteristicsin this research: we (a) solved the problem of missing link,(b) considered the multiple relationships among users, (c)applied the feature-based assisted method to find hiddencommunities, (d) designed a method based on agglomerativeclustering methods, and (e) conducted the local communitydetection for a target user. Therefore, we utilize social inter-actions among entities (i.e., users) to solve the communitydetection problem. To our knowledge, this is the first attemptto solve the community detection problem in Facebookusing a social interaction approach.

Method

As mentioned previously, we use the data from a targetuser’s frequent Facebook interactions to identify the targetuser’s multiple hidden communities. The strategy is toexploit these frequent interaction clusters to help detectFacebook communities. This section is divided into twoparts: problem definition and the algorithm.

Problem Definition

We assume that the target user has N friends on Facebookbelonging to K different communities (relations) in the realworld. Let U = u1, u2 . . .. . . ,un be the set of friends of thetarget user on Facebook. On Facebook, a user can share avariety of content with friends, such as text, links, videos, orimages. In our research, however, we only look at text,

which is called a “status” on Facebook. For simplicity, werefer to this as a “post” rather than a “status” in this article.Let P = p1, p2 . . . . . . , pm be the set of post IDs of thetarget user, where pi is a post of the user.

Suppose the users share m posts. Our objective is to findK communities among all of the target user’s friends. There-fore, the goal is to detect K communities: C1,C2 . . . CK.The more similar the detected communities are to the targetuser’s K predefined communities, the more accurate totalprediction rate of communication detection is.

In addition, the target user’s friends can reply to a post,forming a conversation. Because people are more likely toparticipate in a conversation with someone they know or areacquainted with, a conversation is probably formed by thosein the same group. These conversations are defined as inter-actions. An interaction sequence Ri is a list of users replyingto post pi. An interaction transaction TRi = uj | uj appears inRi is derived by including every distinct user in Ri.

From the interaction transactions, we find frequent item-sets as our initial groups. An itemset is a set of items (i.e.,users), and the length of an itemset is the number of items init. An itemset with length k can be denoted as k-itemset. Thesupport of itemset x is the number of interaction transactionscontaining itemset x. Let min_sup be the specified minimumsupport level, and sup(x) be the support value of itemset x. Ifx is a k-itemset and sup(x) is not less than min_sup, x is afrequent k-itemset. The set of all frequent k-itemsets can bedenoted as LSk = x |sup (x) > min_sup and x has length k.For example, LS1 represents the set of frequent 1-itemsets.After generating all frequent k-itemsets, we get LS = ls1, ls2

. . . , which contains all frequent itemsets.Furthermore, a frequent itemset is defined as maximal if

it has no frequent superset. For example, suppose there arefive items: a, b, c, d, e, and a, b, c is a frequent itemset. Ifa, b, c, d, a, b, c, e, and a, b, c, d, e are not frequentitemsets, a, b, c is a maximal frequent itemset. Therefore,we can define a set of maximal itemsets as ML = ml1, ml2,ml3 . . . , where mli is the i-th maximal itemset.

Afterward, we repeatedly merge smaller groups to formlarger groups until the number of groups reaches K. Todescribe the merging process, we need the followingdefinitions.

Definition 1: Overlapping of the two itemsets. Let X and Ybe two user (item) sets in the ML set. (X\Y) represents theusers that are in X but not in Y, (Y\X) represents the users thatare in Y but not in X, and (X*Y) is the set of all pairs of userswhose first member comes from X and the second memberfrom Y.

To merge smaller groups to form larger groups, fourindices are used to evaluate which pair of itemsets should bemerged.

First, a speculation inspired from networking phenomenonis the following: If two groups of people often participate inthe same posts, these two groups may belong to the samegroup. Accordingly, we have the first index M1, which is theproportion of pair-of-users in (X\Y)*(Y\X) belonging to LS2.

FIG. 3. Example of an affiliation network. It is an excerpt from the workof Wasserman and Faust (1994). An affiliation network is used to describethe relations between actors and events. Because Drew joins party 2, thereis a link between Drew and party 2. [Color figure can be viewed in theonline issue, which is available at wileyonlinelibrary.com.]


Definition 2: Interaction index M1. M1 is defined as:

i j i j LS and i X Y j Y X

N X Y N Y X

, | , \ \

\ \

( ) ( ) ∈ ∈( ) ∈( )( ) × ( )

2

where N(X\Y) is the number of users in (X\Y), N(Y\X) is thenumber of users in (Y\X), and N(X\Y) ¥ N(Y\X) is the numberof pairs of users in (X\Y)*(Y\X).

When the value of M1 is large, it means that the users in(X\Y) and the users in (Y\X) frequently interact with eachother in the same posts. Due to their frequent interaction, weconsider merging them into a single group.

Next, a speculation inspired from networking phenom-enona is the following: If many pairs of people between twogroups are friends with each other, these two groups maybelong to the same group. Accordingly, we have the indexM2, which is the proportion of pair-of-users in (X\Y)*(Y\X)who are friends with each other.

Definition 3: Friendship index M2. M2 is defined as:

i j friends i j is True and i X Y j Y X

N X Y N Y X

, | , \ \

\ \( ) ( ) ∈( ) ∈( )

( ) × ( )

where friends(i, j) is the function that returns true if users ui

and uj are friends.When the value of M2 is large, it means that many users

in (X\Y) and many users in (Y\X) are friends with each other.Therefore, we consider merging them into a single group.

Further, a speculation inspired from networking phenom-enona is the following: If many pairs of people between twogroups have many mutual friends, these two groups maybelong to the same group. Accordingly, we have the indexM3, which is the proportion of pair-of-users in (X\Y)*(Y\X)with more than q mutual friends.

Definition 4: Social index M3. M3 is defined as:

i j mutual i j q and i X Y j Y X

N X Y N Y X

, | , \ \

\ \( ) ( ) > ∈( ) ∈( )

( ) × ( )

where mutual(i, j) is the number of mutual friends users ui

and uj have.When the value of M3 is large, it means the users in (X\Y)

and the users in (Y\X) have many common friends. Due tothe large amount of mutual friends, we consider mergingthem into a single group.

The last index computes similarities in users’ profiles. Weadopted a profile based measure proposed by Akcora,Carminati, and Ferrari (2011). They chose the occurrencefrequency (OF) similarity proposed by Jones (1993) forprofile similarity. This index is based on homophily theory(McPherson, Smith-Lovin, & Cook, 2001) and assumes thatfriends are in some way similar to each other.

On Facebook, some fields have more than one value. Forexample, a user may have multiple records in the work and

education fields because of different jobs and attendance atdifferent schools.

Let f be a profile field with a set of subfields Sb. Let fi andfj be the field values in the profiles of users ui and uj, respec-tively. The value of the w-th subfield for fi is expressed as fiw.The single value similarity is given by the following occur-rence frequency:

OF f fSb

if f f

A B if f fi j

iw jw

iw jww Sb

,( ) = ×=

+ ×( ) − ≠⎧⎨⎩∈

∑1 1

1 1

where ANof

r fB

Nof

r fiw iw

= ( )⎛⎝⎜

⎞⎠⎟

= ( )⎛⎝⎜

⎞⎠⎟

log , log ,, r(x) is the

number of records with field value x, and Nof is the totalnumber of field values in the data set.

OF is adopted to compute the similarity between fieldsin two profiles. Let V(fi) and V(fj) be the set of valuesfor field f in the profiles of users ui and uj. V(fi) = fvi1, fvi2,. . . , fvis, and V(fj) = fvj1, fvj2, . . . , fvjt. The lengths of V(fi)and V(fj) are |s| and |t|, respectively. AllSimSet is the set ofsingle-value field similarities computed between allpossible pairs in V(fi)*V(fj), where AllSimSet = of |of = OF(fvis, fvjt), "fvis ∈ V(fi), "fvjt ∈ V(fj) and its lengthis |s|*|t|. MaxSimSet is the set of s elements withthe largest OF values selected from the AllSimSetset, where MaxSimSet mof mof mof mofs r= =1 2, , , |…

OF fv fvfv V f ir jtjt j( )∈ ( )max , rr s= 1 to . Hence, field similarity

Sf is defined as: S i js

MaxSimSet moff r,( ) = × ∑1 .

The whole profile field similarity, PFS, between usersui and uj can be obtained through the Sf value:

i jF

S i jf ff i, ,( ) = × × ( )

∈∑1 β , where bf is the predefined

importance coefficient of field f, Sf (i, j) is the similarity offield f, and F is the user profile’s field set.

Accordingly, we have the index M4, which is the propor-tion of pair-of-users in (X\Y)*(Y\X) with profile similaritiesgreater than the specified threshold t.

Definition 5: Profile similarity index M4. M4 is defined as:

i j PFS i j t and i X Y j Y X

N X Y N Y X

, | , \ \

\ \.

( ) ( ) > ∈( ) ∈( )( ) × ( )

When the value of M4 is large, it means the users in (X\Y)and the users in (Y\X) have similar profiles. Therefore, weconsider merging them into a single group.

To compute an overall score for a merging decision, thefour merging indices are combined into a single index: finalmerging score (FMS).

Definition 6: Final merging score. The final merging scoreof communities X and Y is defined as follows.

FMS X Y M W M W M WM W W W

,( ) = ∗( ) + ∗( ) + ∗( )+ ∗ − − −( )

1 1 2 2 3 34 1 1 2 3


Here, weights W1, W2, and W3 are user specified andmust satisfy the relations 0 W1, W2, W3 1, and(W1 + W2 + W3) 1.

Algorithm

The proposed method is shown in Figure 4 and adopts abottom-up approach to detecting communities in Facebook.The first phase (steps 1 to 3) uses association rules mining tofind all frequent patterns of post participants as initial com-munities. In the second phase (steps 4 and 5), the commu-nities are iteratively combined to construct the final Kcommunities. Each two clusters are merged according to thefinal merging score (FMS).

Step 1. Interaction transaction TRi generation. First, usersui who reply to post pi are retrieved as a sequence. Theelements in Ri are repliers’ IDs from post pi. The interactiontransaction TRi is then generated from interaction sequenceRi by removing duplicate elements in Ri.

Step 2. Large k-itemset generation. In this step, 1-itemsetcandidates are generated by including all users in TR. Thenwe select items that have supports greater than min_sup toform LS1. Similarly, LSk(k > 1) can be created by joining thepatterns in LSk - 1. The selected candidates should have sup-ports of no less than min_sup.

Step 3. Maximal itemset generation. To generate maximalitemsets, each large itemset lsi is checked and removed if itis contained in another, larger itemset lsj within the LS set.After pruning all the patterns that are contained in longerpatterns, maximal patterns ML, where mli (i = 1, 2, . . . , n), areobtained and are considered initial communities.

Step 4. Computing merging indices. In this step, the FMSscores (including the four merging indices) with properweights for all possible pairs of communities are calculated.

The four merging indices measure different aspects of theneed-to-be-united extent for each pair of communities.

Step 5. Merging process. Finally, the two initial communi-ties X and Y with the highest FMS values are merged into abigger community Z and put back into the ML set. Then themaximal itemsets that are completely included in Z areremoved from ML. Steps 4 and 5 are repeated until a stopcriterion is reached, meaning when the number of commu-nities equals K.

Experiments

This section illustrates the experiment design and theresults and evaluation of this research. We describe theexperiment process in the following and discuss the resultsof our research experiment to demonstrate the proposedmethod’s performance.

Experiment Design

To evaluate the performance of the proposed method, 10target users who shared more than 50 posts on Facebookwere picked as our 10 data sets for the experiment. Eachtarget user was asked to analyze the friends whoparticipated in the posts and classify them into differentcommunities that could overlap. This can be viewed as theuser-defined answer for evaluation. Our approach wasapplied to each data set to mine communities for each targetuser. Table 1 shows the details of each data set, including thenumber of communities (Num_Of_C), the number offriends (Num_Of_F), and the number of posts (Num_Of_P).

To determine whether our approach can consistentlydetect community distributions for every user, the indices“community purity” and “cluster purity” are utilized. Theyevaluate the correlation between the communities detectedusing our approach and the clusters previously defined bythe target user. Because the algorithm is based on userinteractions, the evaluation indices compute communitydetection performance only for those members who haveparticipated in at least one post.

FIG. 4. Steps of proposed algorithm. First, the interaction sequences areretrieved from Facebook to transform into interaction transactions. Second,the large itemsets are found from the interaction transactions. Third,maximal itemsets are found from large itemsets to generate initial groups.It then uses merging indexes to decide which pair of groups should bemerged. The merging process is repeating until exactly K communities areobtained.

TABLE 1. Attributes of all data sets.

Data set Num_Of_C Num_Of_F Num_Of_P

1 6 433 612 9 655 1123 12 395 2384 5 322 1005 12 433 1236 13 483 1587 12 403 1198 4 232 519 5 372 66

10 7 510 100


Community purity. Community purity measures the extentto which a detected community contains objects from apredefined cluster. This is similar to the concept of precisionin traditional information retrieval research. For eachdetected community i, the probability pij that a member ofdetected community i belongs to predefined cluster j is com-puted as pij = mij/mi, where mi is the number of objects indetected community i and mij is the number of objects inboth predefined cluster j and detected community i. Thecommunity purity of detected community i is pi

* = maxj pij.

Cluster purity. Cluster purity measures the extent to whicha user-defined cluster contains objects from a detected com-munity. This is similar to the concept of recall in traditionalinformation retrieval research. For each user-defined clusterj, the probability pji that a member of user-defined cluster jbelongs to detected community i is computed as pji = mji/mj,where mj is the number of objects in user-defined cluster jand mji is the number of objects in both detected communityi and user-defined cluster j. The cluster purity of user-defined cluster j is pj = maxi pji.

Total-community-purity and total-cluster-purity. In addi-tion, Total-Community-Purity (TCMP) and Total-Cluster-Purity (TCRP) are proposed to evaluate overallperformance. The former evaluates the extent to whichdetected communities can correctly map to user-definedclusters.

TCMPm

mpi

ii

k=

=∑ *1

The latter measures the extent to which the predefinedclusters are correctly identified by detected communities.

TCRPm

mpj

jj

k=

=∑ ′1

Experiment Results and Evaluation

The first part of the experiment was parameter analysis todetermine the optimal parameter combinations for the entirealgorithm. We decided to show the parameter testing resultsfor only one data set, data set 10, because the results for theother data sets are similar. The seven parameters of ouralgorithm were the control variables within the analysis.They are the weights of merging indices W1, W2, W3, andW4, profile similarity threshold t, number of mutual friendsthreshold q, and minimum support value min_sup. Theoptimal combination of parameters was applied in the finaltest.

During the test, we tried various combinations of weightsto find the optimal solution. Because the sum of the fourmerging weights must equal one, we cannot vary a weightwithout affecting the others. Therefore, we followed a sys-tematic procedure to examine all combinations. The value ofeach parameter (i.e., W1, W2, W3, and W4) was varied from0.1 to 0.5 with increment 0.05. First, we varied W2 by fixingthe value of W1, and then varied W3 and W4 by fixing W1and W2. W3 and W4 are passive and are determined auto-matically after the first two weights are set. There are in total434 possible combinations, and all the combinations of thefour weights are tested systematically.

To save space, Table 2 shows some records extractedfrom the entire results of the parameter test. The optimalcombination of the four indices with the highest TCMPvalue is (0.2, 0.2, 0.25, 0.35). This combination of weightswas used to test other parameters.

Next, we fixed the weights of the four merging indicesand tested the optimal value of profile similarity threshold t,which is the threshold value used in profile similarity indexM4. Table 3 shows the results of varying variable t from 0.1to 0.9. We can see that the resulting TCMP value is worse ifthe value of t is very low or very high, and the reasons arefairly simple. For instance, if t is set too high, it is possible

TABLE 2. Results of control variable: weights of merging index.

Rid W1 W2 W3 W4* t q min_sup Total-Community-Purity

1 0.25 0.1 0.1 0.55 0.5 50 3 0.700482 0.25 0.15 0.15 0.45 0.5 50 3 0.800483 0.25 0.2 0.2 0.35 0.5 50 3 0.803304 0.25 0.25 0.25 0.25 0.5 50 3 0.875165 0.25 0.3 0.3 0.15 0.5 50 3 0.681416 0.25 0.35 0.35 0.05 0.5 50 3 0.831117 0.1 0.25 0.1 0.55 0.5 50 3 0.707868 0.15 0.25 0.15 0.45 0.5 50 3 0.731119 0.2 0.25 0.2 0.35 0.5 50 3 0.84000

10 0.3 0.25 0.3 0.15 0.5 50 3 0.8564011 0.35 0.25 0.35 0.05 0.5 50 3 0.7464012 0.1 0.1 0.25 0.55 0.5 50 3 0.7105013 0.15 0.15 0.25 0.45 0.5 50 3 0.8254014 0.2 0.2 0.25 0.35 0.5 50 3 0.9074115 0.3 0.3 0.25 0.15 0.5 50 3 0.7814316 0.35 0.35 0.25 0.05 0.5 50 3 0.73143

*W4 = 1-W1-W2-W3.


that there will be little to no similar pairs of friends, and wewill end up with an extremely low value for profile similarityindex M4. On the other hand, if t is set too low, the value ofM4 will become unreasonably high, affecting the finalresults. We found the optimal TCMP value is when t equals0.5.

When computing the social index M3, we need to calcu-late the proportion of pair-of-friends who have more than qmutual friends. Because most of our data sets have approxi-mately 200~600 friends and 4~13 communities, 30~90 is areasonable range for testing control variable q. From Table 4we see that 50 is the best solution for control variable q.

A minimal support threshold is used while generatinginitial communities. If this threshold is set too high, evenusers in the same community cannot constitute frequentitemsets. On the other hand, if this threshold is set too low,even users in different communities may constitute frequentitemsets. To find the appropriate value, we varied themin_sup from 2 to 5. As indicated in Table 5, 3 is the bestanswer for the control variable min_sup.

After conducting this parameter analysis, we determinedthe optimal parameters combination to be applied to the finaltest in our experiment, as shown in Table 6.

In the final test, we applied the algorithm with the optimalcombination of parameters to all 10 data sets. Because thereis no other community detection approach based on Face-book social interactions, we simply discuss our results andperformance without comparisons to other structure-basedalgorithms.

From Table 7, we can see that we had high performances,that is, high Total-Community-Purity and high Total-Cluster-Purity, for most of our data sets. When the

TABLE 3. Results of control variable: profile similarity threshold t.


1 0.2 0.2 0.25 0.35 0.1 50 3 0.698412 0.2 0.2 0.25 0.35 0.2 50 3 0.730143 0.2 0.2 0.25 0.35 0.3 50 3 0.730594 0.2 0.2 0.25 0.35 0.4 50 3 0.830565 0.2 0.2 0.25 0.35 0.5 50 3 0.907416 0.2 0.2 0.25 0.35 0.6 50 3 0.788437 0.2 0.2 0.25 0.35 0.7 50 3 0.730568 0.2 0.2 0.25 0.35 0.8 50 3 0.721149 0.2 0.2 0.25 0.35 0.9 50 3 0.71578

*W4 = 1-W1-W2-W3.

TABLE 4. Results of control variable: number of mutual friends threshold q.


1 0.2 0.2 0.25 0.35 0.5 30 3 0.635822 0.2 0.2 0.25 0.35 0.5 40 3 0.664933 0.2 0.2 0.25 0.35 0.5 50 3 0.907414 0.2 0.2 0.25 0.35 0.5 60 3 0.76995 0.2 0.2 0.25 0.35 0.5 70 3 0.756226 0.2 0.2 0.25 0.35 0.5 80 3 0.716217 0.2 0.2 0.25 0.35 0.5 90 3 0.6699

*W4 = 1-W1-W2-W3.

TABLE 5. Results of control variable: minimal supports min_sup.


1 0.2 0.2 0.25 0.35 0.5 50 2 0.787412 0.2 0.2 0.25 0.35 0.5 50 3 0.907413 0.2 0.2 0.25 0.35 0.5 50 4 0.869904 0.2 0.2 0.25 0.35 0.5 50 5 0.61622

*W4 = 1-W1-W2-W3.

TABLE 6. Best combination of parameters in analysis.

W1 W2 W3 W4* t q min_sup Total-Community-Purity

0.2 0.2 0.25 0.35 0.5 50 3 0.90741

*W4 = 1-W1-W2-W3.


Total-Community-Purity is high, it means the detected com-munities are mostly correct. When the Total-Cluster-Purityis high, it means the predefined clusters are mostly discov-ered. When both values are high, it means the detectedcommunities and user predefined groups are highly similar.In other words, our detected communities are highly similarto the true communities in the real world. The results indi-cate that we were successful in finding users’ social com-munities by employing social interactions to solve thecommunity detection problem. In addition, we also show thedetailed performance data for each detected community andeach user defined cluster for data set 1 and data set 10 inTables 8 and 9, respectively. We did not show the detailedperformance data for the other data sets to save space.

We also included two visual results (Visualization ofcommunity detection) for data set 1 and data set 10. The

visualization software for social graph Gephi (GephiConsortium, 2012) was used to generate the social networkgraph for each data set’s target user. Gephi is an interactivevisualization and exploration platform for all kinds ofnetworks and complex systems, along with dynamic andhierarchical graphs (Gephi Consortium, 2012). Through thevisualization of the social network graph, we can easilycomprehend the distribution of communities detected usingour approach.

In the following, we discuss the mining results fromidentifying different communities among a user’s friendsfrom two data sets.

In Figure 5, it can be seen that we found seven differentcommunities, shown as the network structure within thesocial graph. They are labeled 0~6 at the community edges.The user predefined groups are shown as different-coloredcircles, with the group name listed in the black roundedrectangle. There are classmates from elementary school,junior high school, senior high school, college “NCU”(National Central University), graduate school “NCCU”(National ChengChi University), family members, andothers. In this way we can easily see the differences betweenthem. We can also quickly identify where different commu-nities overlap, such as the two red nodes in the figure thatsimultaneously belong to both “NCU” and “NCCU.”

As another example, Figure 6 provides a completepicture of a user’s community distribution. Similarly, we seethat this user has six different communities that correspondto his/her real-world social communities.

Conclusion

This article investigated the problem of identifying usercommunities in a social network by taking advantage ofinteractions on Facebook. In previous research, most exist-ing methods detected communities based on the entire socialnetwork’s structural information. Unfortunately, in reality itis difficult to obtain complete network structure informationfor online social networks due to personal privacy issues,free will involved with connecting with others, difficulty ofaccessing data, and so on. These factors affect the accuracyand effectiveness of community detection. Therefore, wewanted to resolve the deficiencies of the structure-basedapproach by proposing an interaction-based approach. Ouralgorithm used the interaction data to form interaction trans-actions and found frequent interactive patterns as initialcommunities. Next, we defined four indices of similaritiesfor each pair of communities, which we used to select a pairof communities with the highest FMS score to merge. Wecontinued merging the communities until the number ofcommunities equaled K.

We collected data sets from Facebook as experimentresources. Our results show impressive performances withhigh average scores of TCMP and TCRP (i.e., both ~0.8)when finding users’ communities on a social network. Theresults indicate that the communities detected by our algo-rithm are very similar to the clusters defined by users. From

TABLE 7. Performance of all data sets.

Data set TCMP TCRP

D1 0.849315 0.849315D2 0.683871 0.711268D3 0.964912 0.964912D4 0.705357 0.705357D5 0.959184 0.989130D6 0.785700 0.785700D7 0.600000 0.729730D8 0.755102 0.707317D9 0.724490 0.724490D10 0.907407 0.888889Avg. 0.793534 0.805611

TABLE 8. Detailed performance of data set 1.

Detectedcommunity Community-Purity

User definedcluster Cluster-Purity

C1 0.7857 G1 1C2 1.0000 G2 0.8333C3 0.8571 G3 0.9167C4 0.9167 G4 0.6957C5 0.7143 G5 1C6 0.8889 G6 0.8571

TABLE 9. Detailed performance of data set 10.

Detectedcommunity Community-Purity

User definedcluster Cluster-Purity

C1 1.0000 G1 0.7000C2 0.9091 G2 0.8333C3 1.0000 G3 1.0000C4 0.8571 G4 1.0000C5 1.0000 G5 0.7500C6 1.0000 G6 1.0000C7 0.8333 G7 1.0000


FIG. 5. Visualization of communities in data set 10. The friends for this target user are divided into seven groups. Some of the users (i.e., friends) belongto more than one community and each of them is represented as an overlapping node. The green node in the figure is a user being a high school classmateas well as a NCU (college) classmate for the target user. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

FIG. 6. Visualization of communities in data set 1. The friends for this target user are divided into six groups. One of the users (i.e., friends) belongs tomore than one community. It is represented as an overlapping node (i.e., the green node in the figure). This node is a user being an NCU (college) classmateas well as an AUO (company) colleague for the target user. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]


the community visualizations, we realized that a communityis composed of a group of people sharing some commoncharacteristics, such as friends, family, classmates, work,common interests, and so on.

The experimental results showed that the average scores ofTCMPand TCRPfor the proposed method are both ª0.8. Thisvalue of 0.8 seems not so good as compared to the accuracyresults reported by some previous research in communitydetection that also used accuracy as the performance metric.For example, the work of Cai, Shao, He, Yan, and Han (2005)reached an accuracy score of more than 0.9. Compared withthem, the performance in our experiment is not impressive.However, it might not be correct to compare the proposedmethod to other community detection methods directlybecause the design of our method is different from others. Inmost research on community detection, the clustering-likemethods and the density degree between nodes in a networkare used to detect the structure in a network. In the proposedmethod, we detect not only the community structure but alsomultilabels of a user. Moreover, the proposed method alsodeals with the problem of missing structure in a network.Having multiple labels and lacking fully known structurebetter reflect actual relationships among people in the cir-cumstances of real life. In this context, our proposed algo-rithm provides a reasonable clustering result. Future workcan use our findings as a basis to develop new methods or tofurther enhance performance.

The following are some issues that can be addressed in thefuture. First, in computing the profile similarity for a pair ofusers, we had difficulty in determining if the values of thesame field of the two users were the same because a userprofile may contain synonyms, missing values, incompleteinformation, and fake values. The algorithm can be extendedwith an ontology or a dictionary, or by using the Googlesearch engine, to resolve the synonym problem. Anotherissue is to extend our algorithm so that even communitiescontaining friends who are inactive/idle in social interactionscan be detected correctly. Our method effectively detectscommunities in social networks when friends frequentlyinteract with each other on posts. However, in reality there arealways some who are inactive in social interactions. A pos-sible approach is to develop a classification model to assigninactive users to the correct communities utilizing structureinformation or profile content information. The other issue isto design or develop a new optimization methodology todecide the right number of communities (i.e., K) and to beable to automatically detect communities without user inter-vention. Finally, the last issue is to develop an interaction-based method to detect communities in dynamic and evolvingnetworks.

There are several implications from the experimentalresults. Detecting hidden communities in the network of auser in social network analysis is clearly a popular andhighly interesting topic. For marketers or businesses, seeingall of a target user’s different communities is definitelya key to customize marketing strategies. For instance, agame company can promote their new games using the

information on community structures. A user may be morewilling to join or buy the new game when her/his friends inthe same community are playing the game or the promotioninformation is recommended by his/her friends in the samecommunity. This is conceptually similar to the idea ofword-of-mouth promotion. The possible relevance to otherresearch is that this research provides a good starting pointto consider feature assistance as a complement to detectcommunity distribution of a user. Because a user on Face-book has a variety of features, one may consider usingfeatures such as check-in information, logs, links, images,and videos other than what we used in this research to findhidden communities in the future.

References

Abello, J., Resende, M., & Sudarsky, S. (2002). Massive quasi-cliquedetection. Proceedings of the 5th Latin American Symposium onTheoretical Informatics, 598–612.

Akcora, C.G., Carminati, B., & Ferrari, E. (2011). Network and profile basedmeasures for user similarities on social networks. IEEE InternationalConference on Information Reuse and Integration (IRI) (pp. 292–298).

Albert, R., Jeong, H., & Barabási, A.L. (1999). Diameter of the world-wideweb. Nature, 401, 130–131.

Asratian, A.S., Denley, T.M.J., & Häggkvist, R. (1998). Bipartite graphsand their applications. New York: Cambridge University Press.

Branting, L.K. (2012). Context-sensitive detection of local communitystructure. Social Network Analysis and Mining, 2(3), 279–289.

Bron, C., & Kerbosch, J. (1973). Finding all cliques of an undirected graph.Communications of the ACM, 16, 575–577.

Cai, D., Shao, Z., He, X., Yan, X., & Han, J. (2005). Mining hiddencommunity in heterogeneous social networks. Proceedings of the 3rdInternational Workshop on Link Discovery (LinkKDD ’05) (pp. 58–65).

Chen, J., Zaiane, O., & Goebel, R. (2009). Local community identificationin social networks. Proceedings of the International Conference onAdvances in Social Networks Analysis and Mining (ASONAM), Athens,Greece, 20–22 July.

Clauset, A., Newman, M., & Moore, C. (2004). Finding community structurein very large networks. Physical Review E, 70(6), 066111-1–066111-6.

Donetti, L., & Munoz, M.A. (2004). Detecting network communities: Anew systematic and efficient algorithm. Journal of Statistical Mechanics:Theory and Experiment, P10012.

Du, N., Wu, B., & Wang, B. (2006). A parallel algorithm for enumeratingall maximal cliques in complex networks. The 6th ICDM2006 MiningComplex Data Workshop (pp. 320–324).

Fortunato, S. (2010). Community detection in graphs. Physics Reports,486, 75–174.

Gephi Consortium. (2012). The Open Graph Viz Platform. Retrieved fromhttps://gephi.org/

Girvan, M., & Newman, M. (2002). Community structure in social andbiological networks. Proceedings of the National Academy of Sciences(PNAS), 99(12), 7821–7826.

Girvan, M., & Newman, M. (2004). Finding and evaluating communitystructure in networks. Physical Review E, 69(2), 026113-1–026113-15.

Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques, 2nded. San Mateo, CA: Morgan Kaufmann Publishers.

Hastings, M.B. (2006). Community detection as an inference problem.Physical Review E, 74(3), 035102.

Jones, K. (1993). A statistical interpretation of term specificity and itsapplication in retrieval. Journal of Documentation, 28(1), 11–21.

Li, X.-L., Foo, C.-S., & Ng, S.-K. (2007). Discovering protein complexes indense reliable neighborhoods of protein interaction networks. IEEEComputer Society Bioinformatics Conference (pp. 157–168).

Li, X.-L., Foo, C.-S., Tew, K.L., & Ng, S.-K. (2009). Searching for risingstars in bibliography networks. In X. Zhou, H. Yokota, K. Deng, &


https://gephi.org/

Q. Liu (Eds.), Proceedings of the 14th International Conference onDatabase Systems for Advanced Applications (pp. 288–292). New York:Springer.

Li, X.-L., Tan, S.-H., Foo, C.-S., & Ng, S.-K. (2005). Interaction graphmining for protein complexes using local clique merging. Genome Infor-matics, 16(2), 260–269.

Li, X., Wu, M., Kwoh, C.-K., & Ng, S.-K. (2010). Computationalapproaches for detecting protein complexes from protein interactionnetworks: A survey. BMC Genomics, 11(Suppl 1), S3–19.

McPherson, M., Smith-Lovin, L., & Cook, J. (2001). Birds of a feather:Homophily in social networks. Annual Review of Sociology, 27, 415–444.

Newman, M.E.J. (2004). Detecting community structure in networks. TheEuropean Physical Journal B, 38, 321–330.

Nisheeth, S., Anirban, M., & Rastogi, R. (2008). Mining (social) networkgraphs to detect random link attacks. In IEEE 24th International Confer-ence on Data Engineering (ICDE 2008) (pp. 486–495).

Palla, G., Derényi, I., Farkas, I., & Vicsek, T. (2005). Uncovering theoverlapping community structure of complex networks in nature andsociety. Nature, 435, 814–818.

Papadopoulos, S., Kompatsiaris, Y., Vakali, A., & Spyridonos, P. (2012).Community detection in social media performance and application con-siderations. Data Mining and Knowledge Discovery, 24, 515–554.

Pei, J., Jiang, D., & Zhang, A. (2006). On mining cross-graph quasi-cliques.In Proceedings of the Eleventh ACM SIGKDD International Conferenceon Knowledge Discovery in Data Mining (pp. 228–238).

Pons, P., & Latapy, M. (2005). Computing communities in large networksusing random walks. Computer and Information Sciences, 284–293.

Qiu, J., & Lin, Z. (2011). Aframework for exploring organizational structurein dynamic social networks. Decision Support Systems, 51, 760–771.

Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., & Parisi, D. (2004).Defining and identifying communities in networks. Proceedings of theNational Academy of Sciences of the United States of America, 101(9),2658–2663.

Redner, S. (1998). How popular is your paper? An empirical study of thecitation distribution. The European Physical Journal B—CondensedMatter and Complex Systems, 4(2), 131–134.

Rees, B.S., & Gallagher, K.B. (2012). Overlapping community detectionusing a community optimized graph swarm. Social Network Analysisand Mining (in press).

Scott, J. (2000). Social network analysis: A handbook. London: SagePublications.

Steinhaeuser, K., & Chawla, N. (2009). A network-based approach tounderstanding and predicting diseases. New York: Springer.

Wasserman, S., & Faust, K. (1994). Social network analysis. Cambridge,UK: Cambridge University Press.

Wu, B., & Pei, X. (2007). A parallel algorithm for enumerating all themaximal k-plexes. In Proceedings of the 2007 International Conferenceon Emerging Technologies in Knowledge Discovery and Data Mining(pp. 476–483).

Wu, F., & Huberman, B.A. (2003). Finding communities in linear time: Aphysics approach. CoRR cond-mat/0310600.

Wu, M., Li, X., Kwoh, C.-K., & Ng, S.-K. (2009). A core-attachmentbased method to detect protein complexes in PPI networks. BMCBioinformatics, 10(169).

Zeng, Z., Wang, J., & Karypis, G. (2006). Coherent closed quasi-cliquediscovery from large dense graph databases. In Proceedings of the 12thACM SIGKDD International Conference on Knowledge Discovery andData Mining (pp. 797–802).


community detection based on social interactions in a social network

Documents