identifying and evaluating the internet opinion leader community based on k-clique clustering

ORIGINAL ARTICLE

Identifying and evaluating the internet opinion leader communitybased on k-clique clustering

Weizhe Zhang • Hui He • Boran Cao

Received: 15 February 2013 / Accepted: 4 December 2013

� Springer-Verlag London 2013

Abstract With the rapid development of the Internet, the

Internet has become an important place for producing and

spreading public sentiment. Opinion leaders play an

important role in leading the public opinion. In this paper,

we extract the communities by analyzing the replies of

each post in the bulletin board system. Then, an opinion

leader community mining method is proposed based on the

level structure. Thus, the communities have better overlaps

and multiple relations. Also, we analyze the revolution of

the opinion leader communities and put forward a time-

dividing method. With this method, we divided whole

communities into different pieces based on the character of

the post and the duration of the time. And we come up with

suitable parameters to receive the evolution results of the

communities. Finally, our experiments prove the efficiency

of the opinion leader community mining method, and we

summarize the properties of the opinion leader community

in revolution.

Keywords Public sentiment � Opinion leader �Discovery of the community � Revolution of the

community

1 Introduction

With the rapid cross-discipline between modern computing

technology and social sciences, social computing has

become more and more important. Social computing is an

area of research at the crossroads of social sciences and

computer science, facing the social activities, structures,

and functions. On one side, it focuses on the computer

applications and information technology in the society. On

the other hand, it is based on social science knowledge,

theory and approaches. It uses the computing and infor-

mation technology to research and enhance the efficiency

of human social activities.

A bulletin board system (BBS) has been a popular

communication platform where people reflect the phe-

nomenon and status of current society by showing trends in

the development of the society. People called opinion

leaders, who have the ability of leadership, are in a central

role of the community. The opinion leaders are activists that

often provide information, opinions, comments, and influ-

ence to others in interpersonal communication networks;

they serve as the intermediaries for the formation of the

mass effect or filter links. By analyzing their comments, we

can understand the general trends of each post in advance.

Therefore, it is important for us to identify the community’s

opinion leaders according to public sentiment.

Besides, it is helpful for people to understand the

dynamic law of the whole network by finding the evolution

of the opinion leader community in the public sentiment.

Therefore, we can achieve optimal network structure,

resources, search, and resources, with the basis for such

recommendation. However, existing community discovery

methods often just extract the parallel relationship between

community structures. These communities are either

independent of each other or overlap. They do not consider

the level relations that may exist between the communities.

What is more, it is not reasonable to partition the types of

the evolution of the communities. Since community change

is often complex, it is difficult to describe a community in

an accurate way only with increasing, reducing, combining

and so on [1, 2].

W. Zhang (&) � H. He � B. Cao

School of Computer Science and Technology, Harbin Institute of

Technology, Harbin 150001, China

e-mail: [email protected]

123

Neural Comput & Applic

DOI 10.1007/s00521-013-1529-1

This paper is organized as follows. In Sect. 2, the related

works are introduced. In Sect. 3, the method of mining the

opinion leader community and how to evaluate the evolu-

tion of the community are presented. The experimental

results and the analysis in this paper are also presented in

Sect. 4. Finally, our work of this paper is summarized in

the last section.

2 Related works

Kemighan and Lin [3] used Java Forum Q&A network as

the objects of researches, HITS and other algorithms was

used to evaluate the authority of the users. Matsumura et al.

[4] studied emotional tendencies opinions network and

identified the opinion leaders based on the sentiment analysis

of operator Law. Zhang et al. [5] studied the BBS network

leaders to identify problems. Based on the right to forward

network assumptions, they proposed an interest-based page

rank algorithm, and measured Tsinghua Forum data.

Among all the methods of the mining community, the

Kernighan–Lin method [6] is a greedy algorithm that divides

the net into two known pieces. It requires knowing the size of

the community in advance. The complexity of the algorithm is

based on the Laplace matrix eigenvalue [7], while the tradi-

tional method is based on a spectral split resistor network

voltage spectrum. The complexity of the Huberman algorithm

is low, but it requires the knowledge of the number of com-

munity networks. However, in the actual network, the number

of communities is unknown. Some of the parameters in the

methods are not very clear, so they are not convenient to use.

The GN (abbreviation for Girvan and Newman who

proposed the algorithm) algorithm [8] is representative of a

splitting algorithm. The idea is to constantly derive the

largest referral number from the network side. The GN

algorithm has two main shortcomings, namely the lack of

definition in the amount of community structures and the

second is the need for double counting. Newman et al. [9]

proposed a fast algorithm based on greedy algorithm ideo-

logical cohesion, which applies to large-scale complex net-

works. The defect in Newman’s algorithm is that when using

either a split or clustering algorithm, the network can be

divided into separate communities. However, the networks

of a community are often interrelated and overlapping. And

the methods are not community-based.

The existing deterministic methods are used for large

networks find separated communities. But most of the actual

networks are made of highly overlapping cohesive groups of

nodes. Palla et al. [10] introduced an approach to analyze the

main statistical features of the interwoven sets of overlap-

ping communities making a step toward the uncovering of

the modular structure of complex systems. After defining a

set of new characteristic quantities for the statistics of

communities, they apply an efficient technique to explore

overlapping communities on a large scale and find that

overlaps are significant, and the distributions we introduce

reveal universal features of networks.

In order to extract a reasonable structure from the com-

munity, the community discovery method introduces time

properties. Toyoda studied the evolution of the Web com-

munity [11]. A Web community is constituted by a number

of Web pages. The Web pages are connected by hyperlinks

which connect to the mass. These pages are usually on the

same subject; the changes of the community directly

respond to changes in the Web topic. Using the respective

Web pages from 1999, 2000, 2001, and 2002, we built the

network and extracted the communities from the archives,

creating community maps, analyzing the evolution of Web

communities, and the emergence of the phenomenon. In

analyzing the evolution of the community, the introduction

of a series of metrics, such as growth, stability, novelty,

disappearance rate, fusion rate, and cleavage rate, is used to

describe the complex changes in the community [12].

Therefore, we have adopted a community-based level

discovery method and introduced a number of algorithms

to find opinion leaders in the extraction of the community,

analyzing the evolution of the network community

according to the different nature of the forum posts using

different length of time division and proposed measure-

ment mechanisms.

3 Parallel proving algorithm based on semi-extension

rule

In this section, we use a level-based community extraction

method when extracting an opinion leader of a community.

Moreover, we come up with some approaches based on the

longest common sequences (LCS) and depth-first-search

(DFS) algorithm in mining the opinion leaders.

Community definition is based on the observation that a

typical node (person) in a community is linked to many

other nodes (persons), but not necessarily to all other nodes

in the community. In other words, a community can be

interpreted as a union of smaller complete (fully con-

nected) sub-graphs that share nodes. We believe that a

community can be seen as a collection of connected groups

in some sense where a group is defined as a fully connected

network diagram. Groups consisting of k nodes are called

the k-cliques. And the value of k is greater than or equals to

2. Here, we do not consider 2-clique which only contains 2

members. Each node presents an individual in a clique or

post. We get the relationship between the individuals from

the graph which consists of the nodes above. If there are

k - 1 common nodes between two groups, we called them

adjacent. If a k-clique can reach another k-clique through a


123

number of adjacent k-cliques, then we called the two k-

cliques as connected. K-clique communities in the network

can be seen as a collection consisting of mutually con-

nected k-cliques. We can define a k-clique chain as the

union of adjacent k-cliques and introduce the k-clique

connectivity: two k-cliques are connected if they are a part

of k-clique chain.

Clustering all of the k-cliques is a necessary condition to

find a k-clique community. At first, we should estimate the

maximum k from the network according to the degree of

each node; then, find the k-clique which contains the cer-

tain node; delete the edge related with or neighbored with

the node in last step, and repeat the second step until

finding all of the k-cliques. At last, let k be k - 1 to repeat

the same thing above to find the whole community [13].

The method to cluster the k-clique community is as

follows: We define a certain node v, two sets A and B. Set

A contains all of the nodes related to the node v. Set

B consists of all the neighbor nodes of set A. For a certain

S, the largest possible clique size is determined by the max

degree in the community.

(1) Initialize set A and set B;

(2) Move a node of B to set A, and delete the nodes that

are not connected with the nodes in set A;

(3) If |A| \ S and |B| = 0, or set A and B are the subset of

clique whose size is smaller than S, stop computing

and return to the second step;

(4) If |A| = S, record the clique, and return to the second

step.

First, the algorithm mines all the values of k to get to the

largest complete sub-graph. The largest complete sub-graph

is called as a clique. The distinction between k-clique and

the clique is that k-clique can be a subset of a larger com-

plete sub-graph. Once the clique has been identified, we can

get the clique overlap matrix. Each row or column of the

matrix corresponds to a group. Each element in the matrix

presents the number of public nodes between the corre-

sponding cliques. We think the k-clique community equals

the connected component of the k-clique. We can find a k-

clique community using the overlap matrix of the clique.

The element in the matrix which has a value less than k - 1

is set to 0 and the remaining elements are set to 1, resulting

in the k-clique connection matrix. Each connected part

constitutes a k-clique community. In this way, we can get a

k-clique community from node v with certain size. The time

complexity of the above method is O(CNDlnN) in where

C and D are fitting parameters.

One of the advantages of this method is: the group–

group overlap matrix contains all the necessary information

for an arbitrary value of k communities. Therefore, once

the overlap matrix is constructed, any k-group community

can be very fast. Without using this method, a simple

k-clique searching method to find each of different k-clique

has to search by starting over.

In addition, when we find opinion leaders in the com-

munity we have found above, we should analyze the nodes

in the cliques.

There are two types of nodes in the community, which

are in no clique or at least in one clique. We should analyze

them in both.

1. For the nodes that are in the cliques, the red node, for

instance, is an intersection of two cliques and public

nodes present their important positions in a clique or in

a community. Public nodes have more chances to

communicate with others. Only in this way, they could

be important nodes or leaders. However, the blue

nodes are not which are less important. In this way, we

put forward the method based on the LCS algorithm.

2. For the nodes that are not in the cliques, the yellow

node, for example, connects the three cliques, but does

not belongs to any of the clique. If we ignore this node,

these three cliques may lose connection with each

other and become isolated. In addition, the cut point is

very important in network graph. It can be regard as an

opinion leader. Therefore, we come up with an

approach based on DFS algorithm.

For two sequences, Xm = x1 x2 x3 … xm and Yn = y1 y2

y3 … yn, the algorithm is as follows:

If in a connected undirected graph an arbitrary vertex is

deleted, the remaining graph is still connected, then such

non-connected graph is called two-pass connected.


123

If a node in a bi-connected graph is compared to a

computer, then, if a computer is faulty and cannot run, the

network e-mail will not be affected, but if the network

graph is not connected, then removing a certain point will

cause the failure of the entire network, which means it

cannot function properly. We find that cut points and

bridges have obvious military significance and economic

significance. Therefore, we need to find the community

network cut point or bridge.

Finding a common cut-point algorithm is based on the

DFS algorithm; record the depth (dep) of each node in the

DFS algorithm and the shallowest depth its descendants

can achieve. For each of node m: (1) If node m is the root

node and has two or more sons, then m is a cut point; (2) If

m is not a root and there is a son v, and low[v] ] dep[m],

then m is the cut point in which low[v] presents the shal-

lowest depth node v can arrive, and dep[m] presents the

visit order of node m in DFS algorithm.

We can know the importance and the position of each

node by finding the cut point in the community. And we

can find the importance in the connection of the node in the

certain community.

When we analyze the revolution of the community

which we have found above, we introduce time properties.

A reasonably selected time period has a great impact on

the evaluation of the community revolution; one needs to

consider the reality of the characteristics of the system

itself on how to determine the length of the time period. If

the divided time period is too long, it is difficult to find

the emergence of the network phenomenon. If the divided

time period is too short, the number of unstable network

nodes and edges may be greatly increased; the results of

the community revolution are more susceptible to noise

interference, and it will increase computational cost [14,

15].

Then, we put forward some parameters to measure the

community revolution. We define ci and cj that represent

the number of the nodes (persons) in the community i and

j.

Definition 1. Stability Stability is the ratio of the static

members in the revolution of the community. The higher

the stability is, the more of the stable members in the

community. If the stability is one, then the whole com-

munity keeps unchanged.

Rstabilityðci; cjÞ ¼jci \ cjjjcij

ð1Þ

Definition 2. Disappearance Disappearance is the

number of the members that disappeared which is in

proportion to the members that have disappeared and

inversely proportional to the size of the community itself.

Rdisappearðci; cjÞ ¼jci � cjjjcij

ð2Þ

Definition 3. Growth Growth is the ratio of the members

that increased in the community revolution which is in

proportion to the members that are new born and inversely

proportional to the size of the community itself.

Rgrowðci; cjÞ ¼jcj � cijjcij

ð3Þ

Definition 4. Alteration Alteration reflects the extension

of the changes of the members in the community

revolution.

Ralterationðci; cjÞ ¼jcj � cijjcjj

ð4Þ

Definition 5. Correlation Correlation shows the

relationship between two communities. It is defined as

follows:

CRðci; cjÞ ¼jci � cjjjci � cjj

ð5Þ

If ci and cj are the same structure of one community in

an adjacent time status, then the relevance between them is

called self-correlation. It is obvious that the range of the

correlation is [0, 1]. If CR(ci,cj) = 1, the members in the

community keep unchanged, if CR(ci,cj) = 0, the members

in the community are totally different.


123

4 Experiments

In this section, the data are collected from the TianYa

forum and the start time of the posts is from December,

2010. We chose posts that contained between 10,000 and

20,000 replies, with long time spans of about 1–2 years.

We use the methods that have been described above to find

opinion leaders and evaluate the community revolution.1

The environment of the experiments is shown as

follows:

• CPU: Intel(R) Core(TM)2 CPU 1.67 GHz

• Cache size: 512 KB

• Memory: 2074492 KB

• Operation System: Windows7, Linux9.04

For instance, if the time span of the post is 2 months,

and the time division is 1 month for only 2 months, then

we cannot get the desired results. If the time is divided into

day segments, then we will require larger amounts of data

analysis. The results are messy and difficult to clearly

reflect the results of the evolution of the community. In

each network diagram extracted, we count the number of

nodes, edges, average degrees of nodes, and other features.

Based on the level extraction method and the LCS

algorithm, we get several public nodes in the posts and

analyze that during the 8 weeks, there are 81 individuals

who are active speakers. The 81 nodes in the community

structure are rather stable. These people are in a more

central location during the 8 weeks.

Later, among the 81 individuals, we find that 40 out of

81 nodes can be a cut point with the DFS-based method.

Thus, we get the connectivity of the certain cut point in the

community graph. Therefore, we can say that these 40

individuals found within the 81 total individuals are more

important.

Through this extraction method and the discovery of the

opinion leaders approach, we can effectively find the

opinion leaders in the post.

Table 1 shows the result of the method proposed in the

previous section with definition (1)–(5). From Table 1, we

notice the node is changing rapidly. Sometimes the

increasing speed is high and sometimes low. The disap-

pearance is at a high speed between 0.6 and 0.9. Therefore,

the change is also high, resulting in the correlation between

0.1 and 0.2.

Table 2 shows the statistics of a single post, including

the new nodes, the disappearance of nodes, and the public

nodes. We can draw from the table an increase of 1,920

new born nodes in the post. However, the disappearance of

2,026 nodes is more than the number of new born nodes.

The number of participants decreases very fast after

4 weeks. This explains that as time goes by, the posts’

popularity gradually decreases and the number of people

who respond reduce.

Table 3 shows the total number of communities, the

number of large communities, the number of small com-

munities, and the number of cliques in the single post.

According to the post kinds, the communities’ size, and the

number of the communities, we define a community as a

large community if the number of nodes in the community

is greater than or equal to 4. Small communities are defined

by having a number of nodes fewer than 4. And we justify

the results for all kinds of posts. From this table we have a

small number, eleven, of large communities, while the

number of small communities is 42. Therefore, the

emergence number of small communities is more than

large communities and small communities change faster

(Fig. 1).

To get the number of nodes in the post over time, Fig. 2

shows the number of nodes in the distribution of the single

post in each week. A trend can be seen from the figure that

the number of nodes is decreasing overall. But in the third

week there is a small increase. We get the number distri-

bution of the responses between people (see Fig. 3) in a

single post. In the third week it reaches the maximum

Table 1 The changes of the parameters in measuring the

communities

Time Stability Disappearance Growth Alteration Correlation

Week 1 – – – – –

Week 2 0.24 0.76 0.90 1.66 0.13

Week 3 0.37 0.63 0.97 1.70 0.19

Week 4 0.28 0.72 0.49 1.21 0.22

Week 5 0.01 0.99 0.01 1.00 0.01

Week 6 0.14 0.86 0.71 1.57 0.08

Week 7 0.08 0.92 0.17 1.19 0.07

Week 8 0.33 0.67 1.13 1.80 0.14

Table 2 The initial information of the communities and the changes

of the nodes

Time Increased

number

Disappeared

number

Public nodes

number

Week 1 – – –

Week 2 708 592 188

Week 3 867 562 334

Week 4 322 869 332

Week 5 7 587 7

Week 6 10 12 2

Week 7 2 2 1

Week 8 4 2 1

1 http://www.tianya.cn/publicforum/content/funinfo/1/2193160.shtml.


123

http://www.tianya.cn/publicforum/content/funinfo/1/2193160.shtml

number of responses, and then the number of responses is

decreasing.

Figure 4 shows the distribution of the number of com-

munities in a single post. In general, the curve shows a

downward trend, but there is a small upward trend during

the period. It proves that the popularity of the post reduces,

but the number of people involved will not soon be reduced

as time goes by. Instead, some new people come to par-

ticipate in a forum reply.

The analysis of the evolution of multi-posts in the Tia-

nYa Forum is similar to the single post. The main proce-

dure is as follows. First, we compare the number of nodes,

replies, the number of communities generated and the

number of groups between each post at the same time

period with similar characteristics. Then, find nodes which

are in an important position in each post. Finally, we

analyze the characteristics of the evolution of different

kinds of posts.

Figure 5 shows the number of responses between people

in sever posts during the same period of time. We can see

that in the No 2 post during the 2008.6–2009.3 periods, the

number of responses increased dramatically in the previous

2 months, while nearly no replies appear after the second

month. In the No 1 and No 7 posts, the number of

responses balances, but the duration of the No 7 post is

longer. The No 3 post has the longest duration and first

increases then reduces wave changes. The reason is that, at

the beginning the replies gradually increased, and then the

number of participants started to decrease. After that there

will be a new group of people participating in the discus-

sion of the posts.

Figure 6 shows the distribution of the number of nodes

in seven posts during the same period of time. We can see

that the number of nodes in the basic graph is similar to the

changes of the number of responses. We can simply infer

that when the number of nodes increases, the number of

responses will also increase.

Figure 7 shows the changes of the number of cliques

that are created during the same period of time in seven

posts. It shows that the number of cliques in the group

generated by each post is proportional to the number of the

nodes and the number of edges. But the relationships,

whether close or not, determine the number of generated

groups to a certain extent.

Table 3 The number of the communities and the cliques in the posts

Time Community

number

Large

community

number

Small

community

number

Clique

number

Week 1 16 3 13 1343

Week 2 6 2 4 3191

Week 3 11 3 8 4756

Week 4 10 1 9 1581

Week 5 2 1 1 11

Week 6 5 1 4 5

Week 7 1 0 1 1

Week 8 2 0 2 2

Fig. 1 Diagram of analysis of the nodes

Fig. 2 Distribution of the nodes in communities

Fig. 3 Distribution of the edges in communities


123

Through the analysis of multi-posts, the evolution of

different posts is different. Boring and unattractive posts,

such as news, can get only few responses during the early

time and a small number of communities are generated.

The number of important roles is also rather small. For

those attractive posts which can cause attention, such as

emotional life, entertainment, politics, can last for a long

time. The number of communities is large.

Finally, we took more than 1 year (from December 2010

to May 2012) to add a strong evaluation–human validation,

whose main idea is to compare the leaders found by our k-

clique clustering with the ones given by communities

through polls. We initiated a vote, which is ‘‘Who are the

leaders of public opinion in the TianYa forum; please elect

the top ten of them’’. In total, 32,279 persons volunteered

to vote. We collected and ranked the top 40 leaders

according to the number of the polls. Table 4 shows that

among 40 leaders found by our k-clique clustering, about

31 ones are also given through polls. The precision rate is

77.5 %.

5 Conclusions

Because the traditional extraction methods are not adaptive

for community social networks, we use the level-based

method to extract the opinion leader community. The data

from the TianYa Forum are analyzed. The nodes with high

betweenness can be cut points, which have more chances to

communicate with others. This paper analyzes the evolu-

tion of the community according to different network

characteristics by dividing it into different timepieces.

Then, we introduce a measuring method and k-clustering

algorithms for mining opinion leaders and evaluating the

community revolution. In the experiment, the level-based

Fig. 4 Distribution of the number in communities

Fig. 5 The number of the edges in each of the post

Fig. 6 The number of the nodes in each of the post

Fig. 7 The number of the cliques in each of the post

Table 4 The human validation results by comparing the leaders

found by our k-clique clustering with those given through polls

Leaders found by k-clique

clustering

Leaders given

through polls

Rate of

success (%)

40 31 77.5


123

community extraction method and the opinion leaders

finding algorithm are effective. The association between

communities on the TianYa Forum is around 0.3 in gen-

eral. There are a few large communities, which are stable.

However, there are many small communities, which are not

stable, easy to burst and disappear. Overall, the results are

more in accord with the real scenes.

Acknowledgments This work was supported by the National Basic

Research Program of China under Grant No. G2011CB302605, the

National Natural Science Foundation of China (NSFC) under Grant

No. 61173145, and the National High Technology Research and

Development Program of China under Grant No. 2011AA010705.

References

1. Changai H, Zhu L (2010) The analysis and the evaluation of

complicated network software. Digit Libr Forum 5:33–39

2. Wang Y, Xia H, Yan R (2008) The analysis of the social network

and the study of the application cases of NetDraw. Mod Edu

Technol 18(4):85–89

3. Kemighan BW, Lin S (1970) An efficient heuristic procedure for

partitioning graphs. Bell Syst Tech 49(2):212–230

4. Matsumura N, Ohsawa Y, Ishizuka M (2002) Influence diffusion

model in text-based communication. Trans Jpn Soc Artif Intell

17:259–267

5. Zhang J, Ackerman MS, Adamic L (2007) Expertise networks in

online communities: structure and algorithms. In: Proceedings of

the 16th international conference on World Wide Web. ACM

6. Zhou H, Daniel Z, Changli Z (2009) Finding leaders from opinion

networks. In: IEEE international conference on intelligence and

security informatics, 2009. ISI’09. IEEE

7. Pothen A, Simon HD, Liou K-P (1990) Partitioning sparse

matrices with eigenvectors of graphs. SIAM J Matrix Anal Appl

11(3):430–452

8. Michelle G, Newman MEJ (2002) Community structure in social

and biological networks. Proc Natl Acad Sci USA

99(12):7821–7826

9. Newman MEJ, Michelle G (2004) Finding and evaluating com-

munity structure in networks. Phys Rev E 69(2):026113

10. Palla G et al (2005) Uncovering the overlapping community

structure of complex networks in nature and society. Nature

435(7043):814–818

11. Toyoda M, Kitsuregawa M (2003) Extracting evolution of web

communities from a series of web archives. In: Proceedings of the

fourteenth ACM conference on hypertext and hypermedia. ACM

12. Capocci A et al (2005) Detecting communities in large networks.

Phys A Stat Mech Appl 352(2):669–676

13. Palla G, Derenyi I, Vicsek T (2007) The critical point of k-Clique

percolation in the Erd}os–Renyi graph. J Stat Phys

128(1–2):219–227

14. Palla G, Barabasi A-L, Vicsek T (2007) Community dynamics in

social networks. Fluctuat Noise Lett 7(03):L273–L287

15. Palla G, Barabasi A-L, Vicsek T (2007) Quantifying social group

evolution. Nature 446(7136):664–667


123

identifying and evaluating the internet opinion leader community based on k-clique clustering

Documents