clique-based network clustering
TRANSCRIPT
![Page 1: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/1.jpg)
Guang OuyangAdvisor: Dipak Dey
1
![Page 2: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/2.jpg)
Facebook LinkedIn Internet Instagram Tweets Google+ Quora Wechat Stack Oversflow Research Gate
2
![Page 3: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/3.jpg)
Small World: everyone and everything is six or fewer steps away, by way of introduction, from any other person in the world.
Power Law: degree distribution has long tail power law distribution.
Community Structure: community groups based on common location, interests, occupations, etc. are quite common in real networks.
3
![Page 4: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/4.jpg)
Detect community structure in large and complex networks.
Community can be viewed as a summary of the whole network, and therefore easy to visualize and analyze.
Communities provide important information for applications such as market segmentation, building recommender system.
4
![Page 5: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/5.jpg)
Network data is a graph structure made up of ‘nodes’ and ‘links’ that connect them.
Network data tends to have ‘discrete’ similarity matrix.
Most clustering algorithms work on the “continuous” distance or similarity matrix.
Real-world networks usually very large. Even is unbearable for efficiency or space.
5
![Page 6: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/6.jpg)
Edge list:[(1,2),(1,3),(3,4),(4,5),(5,3), (3,6),(6,1), (7,4), (6,7)].
Adjacency matrix:
6
![Page 7: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/7.jpg)
No statistically precise definition so farGenerally speaking, a community is a set of nodes densely connected internallyNodes between two communities are loosely connected
7
![Page 8: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/8.jpg)
A random network without real clustering structure should not be split (type 1 error of over-splitting).
Two weakly connected communities should not be merged (type 2 error of under-splitting).
Modern network data is usually huge, space and time efficient clustering is needed
8
![Page 9: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/9.jpg)
Minimum-cut method(spectral clustering) Hierarchical clustering Girvan-Newman algorithm (betweenness) Modularity maximization Stochastic block model as well as variants
including mixed membership model Finding maximal clique
9
![Page 10: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/10.jpg)
A measure of strength of division of a network into clusters or communities.
where i and j denotes nodes, c denotes clusters, is the (i,j) entry in adjacency matrix A, is the degree of node i, m is the total number of links in a network.
(1)
10
![Page 11: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/11.jpg)
11
ji
1 2 3 4 5 6 7
1 -0.45 0.55 0.55 0.4 -0.45 -0.3 -0.3
2 0.55 -0.45 0.55 0.4 -0.45 -0.3 -0.3
3 0.55 0.55 -0.45 0.4 -0.45 -0.3 -0.3
4 0.4 0.4 0.4 -0.8 0.4 -0.4 -0.4
5 -0.45 -0.45 -0.45 0.4 -0.45 0.7 0.7
6 -0.3 -0.3 -0.3 -0.4 0.7 -0.2 0.8
7 -0.3 -0.3 -0.3 -0.4 0.7 0.8 -0.2
Degrees of the 7 nodes are:
Total Degree:
The modularity matrix below has (i, j) entry:
Node 1, 2, 3, 4 tend to form one community and node 5, 6, 7 for another. The Modularity Q based on this division is the sum of all green cells in modularity matrix divided by 2m: 0.355
![Page 12: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/12.jpg)
High modularity implies dense connections inside communities and sparse connections between communities.
Approximate maximization algorithms:• Greedy algorithms• Simulated annealing• Leading eigen-vector• Louvian’s method• Ensemble learning(Currently fastest)
12
![Page 13: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/13.jpg)
Benchmark model to simulate stochastic block network 1 with built-in cluster structures.
where
Each cluster has 40 nodesModularity-based clustering on random network from stochastic block model.Modularity maximization approach works well if clusters have similar size
13
![Page 14: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/14.jpg)
Random network without cluster structure may be splited. (Erodos Renyi network)
Small clusters in large network may be merged.(Resolution limitation)
Multi-resolution method may not reduce both types of error simultaneously.
A bottleneck of many other network clustering algorithms.
14
![Page 15: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/15.jpg)
Erdos Renyi network of 40 nodes, density 0.1
Modularity Maximized Clustering: Q=0.37
15
![Page 16: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/16.jpg)
Stochastic Block Model 2 with
Two small clusters have 20 nodes, and the largest clusters have 100 nodesThe largest clusters are splitedModularity maximization algorithms tend to fail in networks with clusters of very different sizes
Modularity maximized clustering with Q=0.429
16
![Page 17: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/17.jpg)
Stochastic Block Model 3 with link probability
Cluster size: [800, 400, 50, 20] Modularity method clustering results:
• 7 nodes in cluster 3 are merged with cluster 1• All the 20 nodes in cluster 4 are merged with cluster
1
17
![Page 18: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/18.jpg)
Algorithm 1◦ Global algorithm◦ Cluster internal link density above user defined
threshold guaranteed Algorithm 2
◦ Local algorithm◦ Risk of splitting a cluster is quantified and under
user control◦ Risk of merging clusters are minimized
18
![Page 19: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/19.jpg)
Objective Function:
where p is user defined parameter in [0,1], δ is Kronecker delta symbol, A is adjacency matrix, c is community membership vector, m is total link count
Reward table:Connected pair of nodes
Disconnected pair of nodes
Pair of nodes in the same cluster
1-p -p
Pair of nodes in different clusters
-1+p p
(2)
19
![Page 20: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/20.jpg)
It is guaranteed that every found communities has internal link density higher than user defined threshold p.◦ If p=1, every found communities is a clique.◦ If p=25%, every communities has internal link
density higher than 25%.◦ Communities with link density “significantly”
higher than p will not be split.◦ Communities with link density lower than p will
definitely be split into smaller communities.
20
![Page 21: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/21.jpg)
Maximize objective function (2):
where s is n by 1 vector of community membership with binary entries 1 or
-1, A is adjacency matrix, J is one matrix, I is the identity matrix
Search over all possible divisions is N-P hard Approximate spectral method:
◦ Find the largest Eigen-value w of p-clique matrix:
◦ Choose a corresponding Eigen-vector v of w◦ Use the sign of v to split the network of n nodes
(3)
(4)
21
![Page 22: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/22.jpg)
is the best approximate solution to (3) If , division by v will be executed. If , but , division by v will
still executed. If , and , division by v will be
cancelled
22
![Page 23: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/23.jpg)
Python-scipy wrapper of ARPACK software Iterative matrix-vector product finding Eigen-
value of large sparse or structured matrices. is dense but structured Matrix-vector product requires
much less than the usual operations◦ Adjacency matrix is usually sparse◦ requires only operations◦ requires only operations ◦ Time complexity: per iteration◦ Space complexity: (applicable to huge graph)
23
![Page 24: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/24.jpg)
Usually it is hard to tell how many communities are there in a large network
First split the network into two parts, then divide these two parts, and so forth.
Use the Bipartition Criteria in slide 21 as the stopping criteria of these recursive dividing prodedure
24
![Page 25: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/25.jpg)
p=0.1
p=0.05p=0.02
Stochastic Block Network 2 with
Two small clusters have 20 nodes, and the largest clusters have 100 nodesExpected link density 0.1125
25
![Page 26: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/26.jpg)
Karate Club Member data (34 people) Link density: 0.139
p = 0.1 p = 0.15
26
![Page 27: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/27.jpg)
Doubtful Sound Dolphin (62 dolphins) Link density: 0.084
p = 0.03 p = 0.2
27
![Page 28: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/28.jpg)
Increasing p: zoom in ◦ Smaller communities are found.◦ Risk of merging clusters(type 2 error) is lower.◦ Risk of splitting cluster/Erdos Renyi sub-network
(type 1 error) is higher. Decreasing p: zoom out
◦ Larger communities are found.◦ Risk of merging clusters(type 2 error) is higher.◦ Risk of splitting cluster/Erdos Renyi sub-network
(type 1 error) is lower.
28
![Page 29: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/29.jpg)
Objective: choose parameter p such that at most 2.5% of nodes in an Erdos Renyi sub-network will be trimmed off.
Cause of Type 1 Error:◦ Due to random fluctuation in link formation, 2.5% of
nodes has less than 0.975 np links with the rest 97.5% nodes.
◦ Threshold p is higher than the link density between the 2.5% group and 97.5% group of nodes
Strategy:◦ Choose p to be significantly smaller than observed
total link density .
29
![Page 30: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/30.jpg)
Solution:
Intuition:◦ Use truncated normal distribution to approximate the
distribution of link density between the 2.5% group and the 97.5% group.
Experiment results:◦ In 100 SBM networks, the type 1 error is bounded by
5% (mostly 3.5%).◦ In SBM networks with average degree less than 5,
type 1 error is less than 2%.
(5)
30
![Page 31: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/31.jpg)
When two clusters of size and , link probability will be merged?
where is observed link density. The risk of type 2 error will be bounded by
2.5% if
(6)
31
![Page 32: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/32.jpg)
Challenge:◦ In splitting a sub-network, we usually don’t know
the link density or between two clusters.◦ In theory, there maybe cases when inequality (5)
and (6) are a conflict Solution:
◦ Choose p to be the upper bound in (5)◦ Develop a more flexible algorithm which allows p
very from one sub-network to another. This may reduce the chance of a conflict between inequality (5) and (6).
32
![Page 33: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/33.jpg)
A measure of consistence between a found communities R and real communities F.
where I is the Kullback-Leibler divergence, H is entropy, N is diffusion matrix, and are number of real and found communities.
33
(7)
![Page 34: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/34.jpg)
Review Stochastic Block network 1 through 3 using NMI
Results:◦ Type 1 error is overly controlled for small and
sparse network such as SBM 1.
34
size Link density
Auto chosen p
Average NMI
s.e. Number of simulation
SBM 1 120 0.0723 0.0239 0.8484 0.0195 100
SBM 2 140 0.1125 0.0579 0.9483 0.0078 100
SBM 3 1270 0.0722 0.0574 0.9993 0.0001 100
![Page 35: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/35.jpg)
35
Modularity
p = 0.0888
Stochastic Block model 4 with
Cluster size: (100, 20, 20)Expected link density:0.1507Auto-chosen parameter p from (5) : 0.0888 Using auto-chosen parameter p will end up with merging small clusters 2 and 3Cluster 2 and 3 will be divided only if we zoom in more by increase pModularity method not only merged cluster 2 and 3, but also split cluster 1
![Page 36: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/36.jpg)
36
S1
S0
S2
C1 S3 C2 C3
C4 C5
p(S0) p(S0)
p(S1) p(S1) p(S2) p(S2)
p(S3) p(S3)
are observed link density and node count in sub-network S
![Page 37: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/37.jpg)
Maximize localized clique-index
where T is the binary tree representing the hierarchical clustering process, p(S) is automatic choice of local threshold parameter p for sub-network S, is the indicator if node i and j will be divided in bipartition of S
37
(8)
![Page 38: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/38.jpg)
Every bipartition in sub-network S will bring contribution:
The best bipartition is obtained from the sign of leading Eigenvector of matrix:
The bipartition on S will be cancelled if contribution .
38
(9)
(10)
![Page 39: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/39.jpg)
Each matrix-vector product takes time O(m) Finding leading Eigen-vector takes O(n)
matrix-vector product. On average, the height of the binary tree
representing hierarchical clustering procedure is O(log(n)).
For both global and localized algorithm, the time complexity is or for sparse network.
39
![Page 40: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/40.jpg)
Stochastic Block model 4 with
Cluster size: (100, 20, 20) Average NMI among 100 simulation is 0.9717 Localized clustering algorithm is able to detect the built-in community structure.
40
![Page 41: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/41.jpg)
Stochastic Block Model 6 with 7000 nodes and 10 built-in clusters
Cluster sizes with internal link density: [(3000,0.08), (2000, 0.09), (1000, 0.1), (400,0.15),
(200,0.2), (100, 0.25), (100, 0.25), (100, 0.25), (80, 0.3), (20, 0.7)]
Link density between different clusters is 0.005 Average NMI among 20 simulation is 0.9895 Average Running time: 1.66 seconds
41
![Page 42: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/42.jpg)
Stochastic Block Model with 20000 nodes and 25 clusters Cluster sizes with internal link density: [(3350, 0.045), (3000, 0.05),(2000, 0.07),(2000, 0.07),(2000,
0.07), (1000, 0.09), (1000, 0.09), (1000, 0.09), (1000, 0.09), (500, 0.12), (500, 0.12), (400, 0.14), (400, 0.14), (400, 0.14), (400, 0.14), (200, 0.30), (200, 0.30), (200, 0.30), (100, 0.40), (100, 0.40), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80), (50, 0.80)] Link density between clusters: 0.0001 Average NMI among 10 simulations: 0.8960 Average running time: 12.6 seconds
42
![Page 43: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/43.jpg)
Review of SBM network 1 through 6:
Clustering quality is high for large network or network with high link density
43
Built-incluster
size Link density
Average NMI
s.e. Number of Simulation
SBM1 3 120 0.0723 0.8972 0.0195 100
SBM2 3 140 0.1125 0.9476 0.0051 100
SBM3 4 1200 0.0722 0.9687 0.0028 100
SBM4 3 140 0.0888 0.9717 0.0033 100
SBM5 10 7000 0.0285 0.9895 0.0022 20
SBM6 25 20000 0.005 0.8960 0.0029 10
![Page 44: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/44.jpg)
Global Algorithm:◦ Good for application with specific requirements in
internal link density of every found communities Localized Algorithm:
◦ Good for finding statistically significant communities.
◦ Type 1 error seem to be overly controlled for sparse network.
◦ The conflict between type 1 error and type 2 error is effectively avoided in sample simulated network.
44
![Page 45: Clique-based Network Clustering](https://reader036.vdocuments.net/reader036/viewer/2022062308/55c2c1f6bb61ebd7178b4574/html5/thumbnails/45.jpg)
Erdos Renyi Model may not serve as a good Null Model of random network without built-in communities structures. Statistically significant community for other null model need consideration.
Extend the algorithm to directed network, network with numerical values in adjacency matrix, and network with additional profile information in each node.
Develop close to linear time clustering algorithm.
45