community structure detection - cs

80
Community Structure Detection Amar Chandole Ameya Kabre Atishay Aggarwal

Upload: others

Post on 01-Dec-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Community Structure Detection - CS

Community Structure Detection

Amar ChandoleAmeya KabreAtishay Aggarwal

Page 2: Community Structure Detection - CS

What is a network?● Group or system of interconnected

people or things

● Ways to represent a network:

○ Matrices

○ Sets

○ Sequences

○ Time series

○ Graphs2

Page 3: Community Structure Detection - CS

Networks as Graphs● Entities represented as nodes● Connections between entities

represented as edges● Example

○ Social Network■ People: Nodes■ Friendship: Edges

○ Citation Network■ Documents: Nodes■ Citation: Edges

3

Page 4: Community Structure Detection - CS

Community Structures● Group of vertices within which connections

are dense but between which they are

sparser

● Can be found in the Internet, the world wide

web, citation networks, transportation

networks, email networks, food webs, social

and biochemical networks

4

Page 5: Community Structure Detection - CS

Need to study Community Structures● Networks of interest found to naturally

divide into communities or modules

● Communities generally have similar

behavior, decision making characteristics

● Properties of communities may be quite

different from average properties of the

network

5

Page 6: Community Structure Detection - CS

Detection of Community Structures● Computer Science Approach

○ Graph Partitioning

● Social Science Approach

○ Block Modeling or Hierarchical

Clustering or Community Structure

Detection

6

Page 7: Community Structure Detection - CS

Graph Partitioning● Goal: Find division of network

irrespective of existence of a good

division

● Condition: Number and size of groups is

generally known in advance

● Example: Division of tasks between

different cores of a parallel computer7

Page 8: Community Structure Detection - CS

Graph Partitioning

8

Page 9: Community Structure Detection - CS

Community Structure Detection● Goal: Find a good division of the network

provided that it exists

● Condition: Number and size of groups

are determined by the network itself

● Example: Find most popular messaging

application in each country9

Page 10: Community Structure Detection - CS

Community Structure DetectionMost Popular Messaging App in Every Country Political Orientation Across Blogs on the Internet

10

Page 11: Community Structure Detection - CS

Girvan and Newman (GN) Algorithm● Edge betweenness

○ Number of shortest paths that run

between a pair of nodes

● Each edge assigned a betweenness score

● Edges with high betweenness scores

removed

● Betweenness recalculated

11

Page 12: Community Structure Detection - CS

Performance of GN Algorithm● Computationally very expensive

● For m edges and n nodes:○ Complexity of O(m2n) for regular graphs○ O(n3) for sparse graphs

12

Page 13: Community Structure Detection - CS

What do the papers that we reviewed propose?

13

Page 14: Community Structure Detection - CS

Paper 1 : Fast algorithm for detecting community structure in networks (arXiv’03)

Author: M. E. J. Newman

University of Michigan, Ann Arbor 14

Page 15: Community Structure Detection - CS

Drawbacks of graph partitioning and GN algorithm● Computation effort required is immense

● Feasible even upto thousand nodes

● Infeasible for present day’s interactive

networks containing millions and billions

nodes

15

Page 16: Community Structure Detection - CS

Overcoming drawbacks● Use of modularity

● Optimizing modularity

16

Page 17: Community Structure Detection - CS

1. Modularity Number of edges falling within groups minus the expected number in an equivalent network with edges placed at random

➔ PositiveMagnitude indicates the probability of a community structure

➔ ZeroEdges are simply a random distribution

➔ NegativeIndicates that a community structure must be absent

17

Page 18: Community Structure Detection - CS

18

Example scenario for ModularitySuppose we have 3 vertices and 6 edges.

Then by random distribution of edges, the

network is expected to look like:

EXPECTED:

Page 19: Community Structure Detection - CS

19

Example scenario for ModularitySuppose we have 3 vertices and 6 edges.

Then by random distribution of edges, the

network is expected to look like:

ACTUAL:

Page 20: Community Structure Detection - CS

20

Example scenario for Modularity

Modularity will find out this deviation from expected value using different expressions (explained in later slides) and help us divide network into communities.

ACTUAL:

EXPECTED:

Too many connections than exppected. Something is fishy here!

Page 21: Community Structure Detection - CS

➔ Simulated Annealing

➔ Genetic Algorithms

➔ Greedy Algorithms

➔ Spectral Techniques

2. Ways to Optimize Modularity

21

Page 22: Community Structure Detection - CS

What purpose does Modularity serve in this paper?● Values of Q indicative of presence of

community structure

● Q = 0 indicates random presence

● Q >= 0.3 indicates strong presence

22

Page 23: Community Structure Detection - CS

Mathematical Definitions● Modularity Q is defined as:

Q = ∑ (eii - a

i2) ∀ i where,

- eij is the fraction of edges in network that

connect vertices in group i to those in group j. - a

i = ∑ e

ij ∀ j

● Δ Q = eij + e

ji - 2 a

ia

j = 2 ( e

ij - a

ia

j)

23

Page 24: Community Structure Detection - CS

Optimizing modularity● Ways to divide n vertices into g

non-empty groups is Sn

(g)

● Number of distinct community divisions:

∑Sn

(g) for g=1 to n

● Maximum 20 - 30 vertices

24

Page 25: Community Structure Detection - CS

Greedy Algorithm● Each vertex is considered as a community

● Join 2 vertices such that it results in the

greatest increase or smallest decrease in the

value of modularity

● Process can be represented as a dendrogram

and cuts through it give communities

25

Page 26: Community Structure Detection - CS

Steps to follow1. Calculate eij matrix for the graph

2. Calculate Δ Q for every possible

combination of 2 nodes

3. Join vertices with highest value of Δ Q

4. Calculate Q after the join and update

dendrogram

5. Recompute eij matrix

26

Page 27: Community Structure Detection - CS

Community Structures1. We have the dendrogram with values of

Q after each join

2. Starting from the highest value of Q, if it

is greater than 0.3, cut through it

3. We obtain the final community structure

when no value >0.3 left to be cut

27

Page 28: Community Structure Detection - CS

Example

28

Page 29: Community Structure Detection - CS

The eij matrix :

0 0.3 0.1 0 0

0.3 0 0.2 0 0

0.1 0.2 0 0.1 0

0 0 0.1 0 0.3

0 0 0 0.3 0

29

Page 30: Community Structure Detection - CS

All combinationsΔ Q = 2 ( e

ij - a

ia

j)

(1,2) 2 ( 0.3 - 0.4 * 0.5 ) = 0.2

(2,3) 2 ( 0.2 - 0.4 * 0.5 ) = 0.0

(1,3) 2 ( 0.0 - 0.4 * 0.4 ) = -0.32

(3,4) 2 ( 0.1 - 0.4 * 0.4 ) = -0.12

(4,5) 2 ( 0.3 - 0.3 * 0.4 ) = 0.36

Max Δ Q = 0.36

Thus, join 4 & 5 and calculate Q = -0.28

The eij matrix :

0 0.3 0.1 0 0

0.3 0 0.2 0 0

0.1 0.2 0 0.1 0

0 0 0.1 0 0.3

0 0 0 0.3 0

30

Page 31: Community Structure Detection - CS

The updated eij matrix :

The original eij matrix :

0 0.3 0.1 0 0

0.3 0 0.2 0 0

0.1 0.2 0 0.1 0

0 0 0.1 0 0.3

0 0 0 0.3 0

0 0.3 0.1 0

0.3 0 0.2 0

0.1 0.2 0 0.1

0 0 0.1 0.3

31

Page 32: Community Structure Detection - CS

All combinationsΔ Q = 2 ( e

ij - a

ia

j)

(1,2) 2 ( 0.3 - 0.4 * 0.5 ) = 0.2

(2,3) 2 ( 0.2 - 0.4 * 0.5 ) = 0.0

(1,3) 2 ( 0.0 - 0.4 * 0.4 ) = -0.32

(3, 4-5) 2 ( 0.1 - 0.4 * 0.4 ) = -0.12

Max Δ Q = 0.2

Thus, join 1 and 2 and calculate Q = -0.41

The updated eij matrix :

0 0.3 0.1 0

0.3 0 0.2 0

0.1 0.2 0 0.1

0 0 0.1 0.3

32

Page 33: Community Structure Detection - CS

The updated eij matrix :

The previous eij matrix :

0 0.3 0.1 0

0.3 0 0.2 0

0.1 0.2 0 0.1

0 0 0.1 0.3

0.3 0.3 0

0.3 0 0.1

0 0.1 0.3

33

Page 34: Community Structure Detection - CS

All combinationsΔ Q = 2 ( e

ij - a

ia

j)

(1-2, 3) 2 ( 0.3 - 0.4 * 0.6 ) = 0.12

(3, 4-5) 2 ( 0.1 - 0.4 * 0.4 ) = -0.12

Max Δ Q = 0.12

Thus, join 1-2 with 3 and calculate Q = 0.05

The updated eij matrix :

0.3 0.3 0

0.3 0 0.1

0 0.1 0.3

34

Page 35: Community Structure Detection - CS

The updated eij matrix :

The previous eij matrix :

0.3 0.3 0

0.3 0 0.1

0 0.1 0.3

0.6 0.1

0.1 0.3

35

Page 36: Community Structure Detection - CS

All combinationsJoin the 2 communities:

1-2-3 and 4-5

As there is just one single component now, we stop.

Q = 0.88

The updated eij matrix :

0.6 0.1

0.1 0.3

36

Page 37: Community Structure Detection - CS

1. Now, we observe that the maximum value of Q is 0.88

2. As, this is >=0.3 , we cut through it

3. We obtain 2 communities - (1-2-3) and (4-5)

4. No other value of Q>=0.3

5. The final community structure: (1-2-3) and (4-5)

Splitting into communities

37

Page 38: Community Structure Detection - CS

The dendrogram

38

Page 39: Community Structure Detection - CS

Resulting community structure division

Community 1

Community 2

39

Page 40: Community Structure Detection - CS

Performance of Greedy Algorithm● Computationally less expensive

● For m edges and n nodes

○ Complexity of O(m+n) for each join

operation

○ O(n(m+n)) for n-1 join operations

40

Page 41: Community Structure Detection - CS

Computer generated graph

● n=128 vertices. 4 groups of 32.

● Zin + Zout = 16

41

Page 42: Community Structure Detection - CS

Observations: ● 90% correct for Zout<=6

● For Zout=5, Greedy algo : 97.4% verticesGN algo : 98.9% vertices.

● For greater Zout, Greedy performs better.

42

Page 43: Community Structure Detection - CS

Karate Club Network

● Friendships between 34 members of a club at a US university

over a 2-year period.

● The club naturally split into 2 due to some disputes.

43

Page 44: Community Structure Detection - CS

Observations: ● Peak modularity Q = 0.381

● Vertex 10 classified wrongly.

● GN algorithm classified vertex 3 wrongly.

44

Page 45: Community Structure Detection - CS

Games Network

● Schedule of games between American college football teams in a

single season.

● Teams divided into groups or conferences.

● Intra-conference games more frequent than inter-conference

games.

45

Page 46: Community Structure Detection - CS

Observations:

46

Page 47: Community Structure Detection - CS

How much faster does a greedy algorithm perform in comparison to theGirvan and Newmanalgorithm?

47

Page 48: Community Structure Detection - CS

Some comparisons:

● For the Games Network:

Greedy algorithm : Less than 100th of a second.

GN algorithm: Little over a second

● 1275-node Jazz musicians network:Greedy algorithm : 1 secondGN algorithm : 3 hours.

48

Page 49: Community Structure Detection - CS

42 minutes! vs. 3-5 years!

49

Page 50: Community Structure Detection - CS

Physicists network

● Network of collaborations between physicists as documented by papers posted on the widely-used Physics E-print Archive at arxiv.org

● 56,276 scientists in all branches of physics

● Scientists are considered connected if they have coauthored one or more papers posted on the archive

50

Page 51: Community Structure Detection - CS

Observations:

● 600 communities

● Peak modularity of Q = 0.713

● 4 large communities - 77% vertices

● Astrophysics, High-energy physics, 2 of condensed matter physics

51

Page 52: Community Structure Detection - CS

52

Page 53: Community Structure Detection - CS

Greedy algorithm conclusions: ● The author uses a bottom-up approach

● The algorithm can be recursively applied to identify intricate groups and communities.

● The algorithm is computationally feasible

● It is possible to study larger networks now.

53

Page 54: Community Structure Detection - CS

Paper 2 : Modularity and community structure in networks (PNAS’06)

Author: M. E. J. Newman

University of Michigan, Ann Arbor

54

Page 55: Community Structure Detection - CS

What is a good division for community detection?A good division is not merely the one that has fewer edges

between communities;

A good division is the one that has fewer than expected edges

between communities!

55

Page 56: Community Structure Detection - CS

What purpose does Modularity serve here?Modularity quantifies the idea that, a true community structure in a network corresponds to a statistically surprising arrangement of edges

The paper reformulates modularity in terms of the spectral properties of the network.

56

Page 57: Community Structure Detection - CS

Spectral properties?

Properties of a graph in relationship to the eigenvalues and

eigenvectors of matrices associated to the graph, such as its

adjacency matrix.

57

Page 58: Community Structure Detection - CS

Eigenvalues and Eigenvectors

● If A is a matrix, then the eigenvalues are given by

the solution of λ such that

|A - λI| = 0 ___(1)where I is identity matrix.

● For each value of λ obtained from eq. 1, the

following equation gives an eigenvector:

Av = λv ___(2)

58

Page 59: Community Structure Detection - CS

Spectral Algorithm

● Start with the adjacency matrix A of the network

● Convert A into modularity matrix B

● Find all eigenvalues and eigenvectors of B

● Consider the eigenvector corresponding to the

highest eigenvalue and use the signs of elements in

this vector to decide the group.59

Page 60: Community Structure Detection - CS

Steps to follow1. Calculate the eij matrix for the graph

2. Calculate Δ Q for every possible combination

of 2 nodes.

3. Join the vertices with highest value of Δ Q

4. Calculate Q after the join and update the

dendrogram

5. Recompute the eij matrix

60

Page 61: Community Structure Detection - CS

Example:

61

Page 62: Community Structure Detection - CS

Example:

0 3 1 0 0

3 0 2 0 0

1 2 0 1 0

0 0 1 0 3

0 0 0 3 0

Adjacency matrix

A =

62

Page 63: Community Structure Detection - CS

Calculating Modularity MatrixModularity matrix B is a real symmetric

matrix with elements

Bij = A

ij - k

ik

j/2m where,

- Aij is the adjacency matrix

- ki and k

j are the degrees of vertices

- m is ½ ∑ik

i

63

Page 64: Community Structure Detection - CS

Calculating Modularity Matrix

B12

= A12

- k1

k2

/2m

= 3 - 4*5/2*10

= 3 - 1

= 2

64

Page 65: Community Structure Detection - CS

Calculating Modularity MatrixDoing the same calculation for all B

ij, we get

-0.8 2 0.2 -0.8 -0.6

2 -1.25 1 -1 -0.75

0.2 1 -0.8 0.2 -0.6

-0.8 -1 0.2 -0.8 2.4

-0.6 -0.75 -0.6 2.4 -0.45

B =

65

Page 66: Community Structure Detection - CS

Eigenvalues and Eigenvectors for B

Eigenvalues Corresponding eigenvectors

λ1≅ 3.95139 v1≅ (0.442855, 0.499515, 0.199318, -0.511499, -0.502997)

λ2≅ -2.61989 v2≅ (0.144319, -0.295586, 0.292573, -0.654337, 0.614854)

λ3≅ -2.09078 v3≅ (0.629671, -0.664442, 0.152521, 0.237745, -0.286785)

λ4≅ 0.82013 v4≅ (0.387237, 0.451396, 0.436317, 0.501464, 0.452162)

λ5≅ 0 v5≅ (1, 1, 1, 1, 1)

Leadingeigenvector

Alwayspresent!

66

Page 67: Community Structure Detection - CS

Leading eigenvector is the one with highest eigenvalue

Splitting into communities

λ1≅ 3.95139 v1≅ (0.442855, 0.499515, 0.199318, -0.511499, -0.502997)

Node number: 1 2 3 4 5

+ve +ve +ve -ve -ve

Community: A A A B B

67

Page 68: Community Structure Detection - CS

Resulting community structure division

Community 1

Community 2

68

Page 69: Community Structure Detection - CS

What else do the eigenvectors suggest?

● The elements of the leading eigenvector measure how firmly

each vertex belongs to its assigned community.

● Large vector elements strong central members of their

communities.

● Small vector elements more ambivalent

69

Page 70: Community Structure Detection - CS

Special case!● There is always an eigenvector (1, 1, 1, …, 1) with eigenvalue 0 for every

modularity matrix B

● This is a property of matrices having sum of rows and columns 0

● If all the eigenvectors have negative eigenvalues, then this special eigenvector acts as the leading eigenvector

● The network is said to be INDIVISIBLE in this case

● Big highlight point of this algorithm as many algorithms divide into communities even when it is not needed, based on sheer number of edges between nodes (if they are less)

70

Page 71: Community Structure Detection - CS

More than two communities?

● Possible - because this algorithm knows when to stop dividing!

● Start the algorithm with original network

● Run the same algorithm over the newly formed communities

● Continue unless all the communities obtained are indivisible

71

Page 72: Community Structure Detection - CS

Example verified in the paper● A network of books about politics, compiled by V. Krebs.

● Vertices represent 105 recent books on American politics bought

from the on-line bookseller Amazon.com

● Edges join pairs of books that are frequently purchased by the

same buyer.

● Books were divided (by author) according to their stated or

apparent political alignment, liberal or conservative, except for a

small number of books that were explicitly bipartisan or centrist,

or had no clear affiliation.

72

Page 73: Community Structure Detection - CS

Result & Interpretation

● Two strong communities are formed - one consists almost entirely of conservative books and other liberal.

● Centrist books belong to their own communities and are not, in most cases, as expected

● Indicates that political moderates form their own purchasing community.

● Books appear to form communities of copurchasing that align closely with political views Fig. Dashed lines divide the four communities

found by the algorithm, and shapes represent the political alignment of the books.

Liberal Conservative Centrist

73

Page 74: Community Structure Detection - CS

How better does leading eigenvector method do in comparison to theGirvan and Newmanalgorithm?

Context

Collaboration network of 27000 vertices

74

Page 75: Community Structure Detection - CS

20 minutes! vs. 2-3 years!

75

Page 76: Community Structure Detection - CS

Performance● Fast and accurate● Most time consuming part is evaluation of leading

eigenvector of modularity matrix● For n nodes

○ Complexity of O(n2 logn)○ Better than all except Greedy - O(nlog2 n)○ But quality is better than Greedy, especially

in large networks● Takes ~20 min to process network with ~27,000

vertices on a standard PC (circa 2006)

76

Page 77: Community Structure Detection - CS

Milestones in Community Structure Detection

Girvan, NewmanOn the topic Community

Structure Detection

NewmanFast algorithm for detecting community structure in networks

Clauset, Newman, MooreFinding community structure in very large networks

Over 5000 papers*

Modularity and Community

Structure Detection

77

Community structure in social and biological networks

2002 2003 2004 2006 2007-2017

Newman

* As found on Google Scholar

Page 78: Community Structure Detection - CS

ConclusionBoth greedy and spectral algorithm have their own advantages and shortcomings, but both of them have been strong foundational algorithms in the starting days of Community Structure Detection back in early 2000s.

While spectral has more accurate results, greedy is faster. Both make use of the concept of Modularity successfully to get meaningful & insightful results.

78

Page 79: Community Structure Detection - CS

References

Modularity and community structure in networks. (PNAS’06)

Fast algorithm for detecting community structure in networks (arxiv’03)

Community structure in social and biological networks (PNAS’ 02)

Finding community structure in very large networks(arxiv’04)

Wikipedia

79

Page 80: Community Structure Detection - CS

Thank you!

We are so tired! :/

Questions? :)

80