overcoming resolution limits in mdl community detection

23
2nd SNA-KDD Workshop 24 Aug 2008 Overcoming Resolution Limits in MDL Community Detection L. Karl Branting The MITRE Corporation

Upload: amber-ryan

Post on 31-Dec-2015

52 views

Category:

Documents


1 download

DESCRIPTION

Overcoming Resolution Limits in MDL Community Detection. L. Karl Branting The MITRE Corporation. Outline. Utility functions in community detection Resolution limits MDL-based community detection Previous: RB and AP New: SGE Experimental Evaluation Lessons. - PowerPoint PPT Presentation

TRANSCRIPT

2nd SNA-KDD Workshop 24 Aug 2008

Overcoming Resolution Limits in MDL Community Detection

L. Karl Branting

The MITRE Corporation

22nd SNA-KDD Workshop 24 Aug 2008

Outline

Utility functions in community detection Resolution limits MDL-based community detection

– Previous: RB and AP– New: SGE

Experimental Evaluation Lessons

32nd SNA-KDD Workshop 24 Aug 2008

Utility functions in community detection

Two components of community detection algorithms – Utility function – quality criterion to be optimized– Search strategy – procedure for finding optimal partition

Examples– Garvin & Newman (2003)

Utility function: modularity Search strategy: greedy divisive hierarchical clustering (iteratively

remove highest betweenness edge)

– Newman (2003) Utility function: modularity Search strategy: greedy agglomerative hierarchical clustering

(iteratively choose highest modularity merge)

– Tasgin & Bingol (2006) Utility function: modularity Search strategy: genetic algorithm

42nd SNA-KDD Workshop 24 Aug 2008

Utility functions in community detection

Other search strategies used with modularity – Rattigan, Maier, Jensen (2007)

Utility function: modularity Search strategy: Greedy divisive hierarchical clustering using a

Network Structured Index to approximation edge betweenness

– Donetti & Munoz (2004) Utility function: modularity Search strategy: greedy agglomerative hierarchical clustering with

spectral division

52nd SNA-KDD Workshop 24 Aug 2008

Utility functions in community detection

Statistical Approaches– Zhang, Qiu, Giles, Foley, & Yen (2007)

Utility function: log-likelihood (LDA parameters) Search strategy: fixed-point iteration

Compression-Based Approaches– Rosvall & Bergstrom (2007)

Utility function: Minimum Description Length Search strategy: simulated annealing

– Chakrabarti (2004) Utility function: Minimum Description Length Search strategy: exhaustive search for k, hill-climbing given k

Utility function implicit in search strategy – Raghavan, Albert, & Kumara (2007) – marker passing– Cliques, cores, etc.

62nd SNA-KDD Workshop 24 Aug 2008

Modularity

– W(Dii) = number of edges internal to group i

– li = number of edges incident to vertices in group I

– l = total number of edges Intuitive – expresses intuition that ratio of

internal to external edges is greater for groups than for non-groups

Popular Imperfect

– Fortunato & Barthelemy (2007) Resolution limit: groups conflated if number of vertices less than

– Rosvall & Bergstrom (2007) Biased towards same-sized groups

mi

iii

l

l

l

Dw

1

2)(

l2

72nd SNA-KDD Workshop 24 Aug 2008

Resolution Limit

Ring graph R15,4

– 15 communities– 4 nodes per

community Community structure

that maximizes modularity conflates groups

82nd SNA-KDD Workshop 24 Aug 2008

Approaches to modularity’s resolution limit Apply recursively to large communities (Ruan & Zhang

2007) Apply locally (Clauset 2005) Choose a different utility function

92nd SNA-KDD Workshop 24 Aug 2008

Description Length

Utility of community structure is sum of bits needed to represent– Community structure +– Graph given community structure

Search strategy attempts to minimize description length There is no unique bit count

– Undecidability of Kolmogorov complexity Previous approaches

– Rosvall & Bergstrom (2007): RB Handles group size skew better than modularity

– Chakrabarti (2004): AP– Comparison

Similar breakdown of bits Different calculation

102nd SNA-KDD Workshop 24 Aug 2008

Components of Description

Components (details in paper)1. Bits to represent number of nodes in graph

ignored because not specific to community structure

2. Bits to represent number of groups

3. Bits to represent mapping between nodes and groups

4. Bits needed for number of group-to-group edges

5. Bits needed for adjacencies between nodes Purpose

– 2, 3, 4: represent group structure– 1, 5: represent graph as a whole

112nd SNA-KDD Workshop 24 Aug 2008

Surprising Experimental Result

RB, AP, and modularity compared as utility functions– Applied to ring graphs Rm,c for 4 ≤

m ≤ 16 and 3 ≤ c ≤ 9– Search strategy: greedy divisive

hierarchical clustering (iteratively remove highest betweenness edge)

Unsurprising result. Modularity led to conflated groups for:– m > 8 and c = 3– m > 10 and c = 4– m > 11 and c = 5– m > 13 and c = 6,7

Surprising result.– Both RB and AP conflated at least

one pair of groups in every Rm,c!

122nd SNA-KDD Workshop 24 Aug 2008

Hypothesis

Both RB and AP require at least one bit per pair of groups in term 4

Perhaps this estimation causes group conflation– Term 4 grows as the square of the number of groups– If graph is sparse, conflating groups may save more in term 4

reduction than it costs in term 5 increase

Components1. Bits to represent number of nodes in graph

ignored because not specific to community structure

2. Bits to represent number of groups

3. Bits to represent mapping between nodes and groups

4. Bits needed for number of group-to-group edges

5. Bits needed for adjacencies between nodes

132nd SNA-KDD Workshop 24 Aug 2008

SGE (Sparse Graph Encoding)

Components 1. Bits to represent number of nodes in graph

Ignored, as in RB and AP

2. Bits to represent number of groups Follows RB

3. Bits to represent mapping between nodes and groups Similar to AP

4. Bits needed for number of group to group edges Split into 2 terms

Which pairs of groups are connected (much less than one bit per pair if pairs sparsely or densely connected)

Number of edges between connected groups Grows as number of connected pairs, not total number of pairs

5. Bits needed for adjacencies between nodes Follows RB

142nd SNA-KDD Workshop 24 Aug 2008

Performance of SGE on Ring Graphs

Correct community structure found for every Rm,c for 4 ≤ m ≤ 16 and 3 ≤ c ≤ 9 except– R4,3

– R13,3

Results confirm hypothesis that resolution limit in RB and AP is result of over-counting term 4: the bits needed for group-to-group edges

Significance– Ring graphs rare in real world– How does SGE compare on more realistic graphs?

152nd SNA-KDD Workshop 24 Aug 2008

Uniform random graph

Similar to graphs in Rosvall & Bergstrom (2007)

Test set– 32 vertices– 4 groups– average degree 6– size ratio

{1.0,1.25,1.5,1.75,2.0}– Proportion internal edges

{0.6,0.75,0.9} Example:

– 32 vertices – 4 groups– average degree 6– size ratio 1.25– Proportion internal edges

0.67

162nd SNA-KDD Workshop 24 Aug 2008

Embedded Barabasi-Albert Graphs

Test set– 4 communities

separately generated by preferential attachment

– In each community 4 initial vertices 2-4 edges added

per time step 20 time steps

Example– 4 communities– 4 initial vertices– 3 edges added per

time step– 20 time steps

172nd SNA-KDD Workshop 24 Aug 2008

Evaluation Criteria

Rand index (Rand 1971) Adjusted Rand index (Hubert & Arabie 1985) F-measure – based on same-cluster pairs

– Recall =

– Precision =

– F-measure =precisionrecall

precisionrecall

**2

||

||

sactualPair

sactualPairirsproposedPa

||

||

irsproposedPa

sactualPairirsproposedPa

182nd SNA-KDD Workshop 24 Aug 2008

Results: Uniform random graph

192nd SNA-KDD Workshop 24 Aug 2008

Results: Uniform random graph

202nd SNA-KDD Workshop 24 Aug 2008

Results: Uniform random graph

212nd SNA-KDD Workshop 24 Aug 2008

Results: Embedded Barabasi-Albert

222nd SNA-KDD Workshop 24 Aug 2008

Summary of Evaluation

Random graphs– Community structure is weak

Group sizes are balanced – modularity is best Group sizes are imbalanced – RS is best (as per Rosvall &

Bergstrom 2007)

– Community structure is strong Group sizes are balanced – not much difference Group sizes are imbalanced – modularity is particularly bad (as per

Rosvall & Bergstrom 2007), SGE slightly better than RS and AP

EBA graphs– Sparse – AP and SGE weaker than modularity and RS– Dense – essentially identical accuracy

232nd SNA-KDD Workshop 24 Aug 2008

Conclusion

Narrow – Conflation of groups by MDL in sparse graphs (e.g., ring

graphs) can be avoided by adjusting group-to-group edge counts.

– This change doesn’t hurt performance in more common types of graphs.

– Compression-based clustering works well, but requires tinkering

– Modularity detects weak structure well when graph not too big and groups not too imbalanced

Broad– Still unclear what utility function is best overall– Needed: theory relating graph typology to utility functions