community detection and sampling on networks

34
Community Detection and Sampling on Networks Yilin Zhang Facebook Jan 30 2020 1

Upload: others

Post on 28-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Community Detection and Sampling on Networks

Community Detection and Sampling on Networks

Yilin Zhang

Facebook

Jan 30 2020

1

Page 2: Community Detection and Sampling on Networks

What are networks / graphs?

2

Page 3: Community Detection and Sampling on Networks

There are many network-related problems.

How to find communities?

How to simultaneously analyze different sources of data (e.g. graph and text)?

How to sample on a network? How to estimate on the connected samples?

Rohe 20153

Page 4: Community Detection and Sampling on Networks

My research discussed some Network-related problems.

Network related work

How to find communities?

Regularized Spectral Clustering

Y. Zhang, K. Rohe. Understanding Regularized Spectral Clustering via Graph Conductance. NeurIPS, 2018.

How to simultaneously analyze different sources of data (e.g. graph and text)?

Graph Contextualization with PairGraphText

Y. Zhang, M. Berthe, C. Wells, K. Michalska, K. Rohe. Discovering Political Topics in Social Network Discussion threads with Graph Contextualization. The Annals of Applied Statistics, 12(2), 1096-1123, 2018.

How to sample on a network?

Respondent Driven Sampling

Y. Zhang, K. Rohe, S. Roch. Reducing Seed Bias in Respondent-Driven Sampling by Estimating Block Transition Probabilities. under revision at The Annals of Statistics.

Other Work

Domain AdaptationH. Zhou, Y. Zhang, V. Ithapu, S. Johnson, G. Wahba, V. Singh. When can Multi-Site Datasets be Pooled for Regression: Hypothesis Tests, l2-consistency and Neuroscience Applications. ICML, 2017.

Election Polls Y. Zhang, Q. Li, F. Charles, K. Rohe. Direct Evidence for Null Volatility in Election Polling. in preparation.

Gravitational Waves

X. Zhu, L. Wen, G. Hobbs, Y. Zhang, et al. Detection and localization of single-source gravitational waves with pulsar timing arrays. MNRAS, 449:16501663, 2015.

4

Page 5: Community Detection and Sampling on Networks

Understanding Regularized Spectral Clustering via Graph Conductance

Yilin Zhang & Karl Rohe

University of Wisconsin-Madison

NeurIPS 2018

5

Page 6: Community Detection and Sampling on Networks

Given a graph, one key problem is to find communities.

6

Page 7: Community Detection and Sampling on Networks

Spectral Clustering is one popular approach.

Adjacency Matrix A 2 {0, 1}N⇥N , where Aij =

(1 if i ⇠ j

0 o.w.

Graph Laplacian L = I �D�1/2AD�1/2, where Dii =X

j

Aij

Spectral Clustering partitions the graph based on top eigenvectors of L.

Graph G = (V,E)

is degree of node i.

7

Page 8: Community Detection and Sampling on Networks

an example

Two nodes in the same block? Yes No

connect (edge) probability 0.8 0.2

A simulated network with N = 10000 nodesK = 4 equal-size blocks

top (smallest) 10 eigenvalues

top 2 - 4 eigenvectors

8

Page 9: Community Detection and Sampling on Networks

However, in practice, Spectral Clustering always fails.

In practice, eigenvectors alwayslocalize on just several nodes.

This leads to highly unbalanced clusters.

top eigenvectors

Social network [Y. Zhang, et al. 2018]

9

Page 10: Community Detection and Sampling on Networks

Regularization solves the problem.

In practice, regularizationdelocalizes eigenvectors.

This leads to morebalanced clusters.

Social network [Y. Zhang, et al. 2018]

[T. Qin et al. 2013, A. Amini et al. 2013, K. Chaudhuri et al. 2012,]

Regularization adds a tiny edge on every pair of nodes. i.e. replace G by G⌧ .

Why?

Adjacency Matrix A⌧ = A+⌧

NJ , where J is an all-one matrix.Regularized

Graph Laplacian L⌧ = I �D�1/2⌧ A⌧D

�1/2⌧ , where D⌧ = D + ⌧I.Regularized

10

Page 11: Community Detection and Sampling on Networks

Why Spectral Clustering fails?graph conductance & noises

11

Page 12: Community Detection and Sampling on Networks

Given a node set S with vol(S) vol(Sc), its graph conductance is:

Spectral Clustering likes sets with small conductance!

minS

�(S) relaxes to Spectral Clustering. [U. Luxburg, 2007]

�(S) =cut(S)

vol(S)=

num of edges cut

sum of node degrees

Then, what kinds of sets have small conductance?

12

Page 13: Community Detection and Sampling on Networks

g-dangling sets have small conductance!

Fact 1: g-dangling set has small conductance 1/(2g � 1).

a 6-dangling set

S is g-dangling ()(1) |S| = g, (2) S is a tree, (3) cut(S) = 1.

13

Page 14: Community Detection and Sampling on Networks

There are many g-dangling sets!

14

inhomogeneous random graph model: independent edges (e.g. erdos-Reny, SBM)

peripheral node: O(1) expected degree

(edge probability with any node is < b/N for some constant b)

Page 15: Community Detection and Sampling on Networks

Graphs with many g-dangling sets have many small eigenvalues.

15

Page 16: Community Detection and Sampling on Networks

an example (recall)

Two nodes in the same block? Yes No

connect (edge) probability 0.8 0.2

A simulated network with N = 10000 nodesK = 4 equal-size blocks

top (smallest) 10 eigenvalues

top 2 - 4 eigenvectors

16

Page 17: Community Detection and Sampling on Networks

the updated example

17

Two core nodes in the same block? Yes No

connect prob 0.8 0.2

A simulated network with 10000 core nodes4 equal-size core blocks100 peripheral nodes

peripheral nodes connect prob 0.01

peripheral - coreconnect prob 2.5E-05

top 10 eigenvalues

Page 18: Community Detection and Sampling on Networks

the updated example

18

Two core nodes in the same block? Yes No

connect prob 0.8 0.2

A simulated network with 10000 core nodes4 equal-size core blocks100 peripheral nodes

peripheral nodes connect prob 0.01

peripheral - coreconnect prob 2.5E-05

top 10 eigenvalues

Page 19: Community Detection and Sampling on Networks

the updated example

19

Two core nodes in the same block? Yes No

connect prob 0.8 0.2

A simulated network with 10000 core nodes4 equal-size core blocks100 peripheral nodes

peripheral nodes connect prob 0.01

peripheral - coreconnect prob 2.5E-05

top 10 eigenvalues

Page 20: Community Detection and Sampling on Networks

Recall: Why Spectral Clustering fails?

20

• Spectral Clustering likes sets with small conductance.• Dangling sets have small conductance.• There are lots of dangling sets, which lead to lots of small eigenvalues.

This conceals true clusters even with large k.

Page 21: Community Detection and Sampling on Networks

Why does regularization fix this?Regularization changes the graph conductance.

21

Page 22: Community Detection and Sampling on Networks

Regularized SC likes sets with small CoreCut!

CoreCut: (conductance on the regularized graph)

CoreCut⌧ (S) =cut(S) + ⌧

N |S||Sc|vol(S) + ⌧ |S|

22

Page 23: Community Detection and Sampling on Networks

Conductance finds peripheral sets.

Conductance finds peripheral sets. CoreCut finds core sets.

23

Page 24: Community Detection and Sampling on Networks

Conductance finds peripheral sets.

Conductance finds peripheral sets. CoreCut finds core sets.

24

CoreCut finds core sets.

Page 25: Community Detection and Sampling on Networks

CoreCut ignores peripheral sets and focuses on the core!

Conductance finds peripheral sets.

Conductance finds peripheral sets. CoreCut finds core sets.

25

Regularization increases conductance for peripheral sets significantly, but does not affect core sets that much.

CoreCut finds core sets.

Page 26: Community Detection and Sampling on Networks

CoreCut succeeds when the regularizer overwhelms peripheries but is negligible to the core.Assumptions:

For a graph G and subsets S, S✏, there exists ✏, ↵, s.t.

1. |S✏| < ✏|V | and vol(S✏) < ✏vol(V ),

2. d(S✏) <1� ✏

2(1 + ↵)d(S),

3. �(S) <↵(1� ✏)

1 + ↵.

Propositions:

Under Assumption 1, if we choose ⌧ � ↵d(S✏), then CoreCut⌧ (S✏) >↵(1� ✏)

1 + ↵.

If we choose ⌧ < �d(S) for some � > 0, then CoreCut⌧ (S) < �(S) + �.

S✏ is small enough.

S is dense enough.

S is a good cut.

26

Corollary:

Under Assumptions 1-3, if we choose ⌧ s.t.

↵d(S✏) ⌧ �d(S),

where � = ↵(1� ✏)/(1 + ↵)� �(S), then

CoreCut⌧ (S) < CoreCut⌧ (S✏).

My suggestion in practice:Set ⌧ to be (square root of)

average degree of G.

Page 27: Community Detection and Sampling on Networks

Real data examples.

37 networks from http://snap.stanford.edu/data

10

100

1000

1e+03 1e+05regularized

vani

lla

N5e+05

1e+06

Balance vs balance. Regularization increases balance.

1

100

0.1 1.0 10.0regularized

vani

lla

N5e+05

1e+06

Running time in seconds. Regularized runs ~8x faster.

more balanced faster

27

Page 28: Community Detection and Sampling on Networks

Regularized Spectral Clustering is more balanced & faster!

28

Summary

Why Spectral Clustering fails?Spectral Clustering likes sets with small conductance.Noises such as g-dangling sets have small conductance.

Why Regularization fixes?Regularized SC likes sets with small CoreCut.CoreCut focuses on the core!

Page 29: Community Detection and Sampling on Networks

Thank you!Any questions?

29

Page 30: Community Detection and Sampling on Networks

Appendix slides…

30

Page 31: Community Detection and Sampling on Networks

31

conductance relaxes to Vanilla SC

[Luxburg et al. 2007]

Page 32: Community Detection and Sampling on Networks

Spectral Clustering is fooled by randomness!

Think regression: What do you say if the model perfectly interpolates the data (MSE = 0)? Overfitting!

Spectral clustering overfits to conductance!

Don’t be fooled by randomness!

32

Page 33: Community Detection and Sampling on Networks

Vanilla SC overfits!

33

Think regression: What do you say if the model perfectly interpolates the data (MSE = 0)? Overfitting!

Vanilla SC overfits to conductance. It’s fooled by randomness.

Page 34: Community Detection and Sampling on Networks

In statistical learning theory, there are tools to address overfitting.

34

Regularize to prevent overfitting.

Cross Validation to • measure overfitting • assess the regularization

We can also do this in graphs!