structural measures of clustering quality on graph...

17
Structural Measures of Clustering Quality on Graph Samples Jianpeng Zhang, Yulong Pei George Fletcher, Mykola Pechenizkiy

Upload: others

Post on 26-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

Structural Measures of Clustering

Quality on Graph Samples

Jianpeng Zhang, Yulong Pei

George Fletcher, Mykola Pechenizkiy

Page 2: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

Motivation

New Structural Measures

Main Issues of Evaluation Metrics

Experiment and Analysis

Conclusion & Future Work

Outline

Page 3: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

1. Motivation

• Most graphs in practice are large-scale and/or

streaming. They are too large and can not be

clustered unless we sample a representative

subgraph.

• Challenge: how to evaluate the clustering quality

in the samples of the graph?

3

a) Co-authorship network of

the Los Alamos Condensed

Matter archive[1]

b) Facebook100 graph

of UNC[2] 3

Page 4: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

Key Question

Fig. 1. Problem setting. Let S be a sampled subgraph of a

graph G and π(G) be a valid ground-truth clustering. Given

clustering π(S) of S induced by process P, what is the quality

of π(S) with respect to π(G)?

4

Evaluationmetrics

Sampling

G

π(G)

π(S)Clustering process

Are results

representative?

Seeking and defining ground-truth

Page 5: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

2、 Main Issues of Evaluation Metrics

How to assess the clustering quality by using ground-truth

clusters (i.e., common external properties that the members of

given clusters share ) is an essential procedure.

Unsupervised quality metrics (e.g., cut-size) are used as metrics

of clustering quality, but we are not sure whether the quality

metric gives the expected answers compared with the ground-

truth clusters.

Little attention has been paid to evaluation measures for

clustering quality on samples of graphs.

(1) Ground-truth clusters

5

Page 6: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

Several classic quality metrics were proposed in the literatures.

However, there is no consensus on their quality and how well

they perform on different kinds of graphs.

These metrics try to identify good clusters by quantifying the

value of metrics which may not be the most meaningful

interpretation for what a good cluster is.

(2) The validity of quality metrics

Proposed solution: defining two novel quality

metrics called σ-precision and σ-recall based on

ground-truth clusters.

2、 Main Issues of Evaluation Metrics

6

Page 7: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

The goal: the clusters in sampling graph S are good

representations of the clusters of the original graph G.

• b is a cluster-set of nodes in π(G)

• bs is a cluster-set of nodes in π(S)

• δ is a predefined match threshold, e.g., 90%.

1) Firstly, we proposed a basic metric based on the

set-matching, i.e., δ-coverage.

3、 New Structural Measures

7

Page 8: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

The coverage of the clustering π(G) is given as the number of

clusters in π(G) which are represented by clusters in π(S).

Higher values of coverage mean the clusters in π(S) are more

consistent with and reflective of the ground-truth clusters in π(G).

6

5

7

3 2

1

0

8 6

5

4

7

3 2

1

0

8

G S

{1}, {0,2,3} ∈ {0,1,2,3}

{5,6,7,8} ∈ {4,5,6,7,8} δ-coverage =2

An illustration example

(δ=1)

8

Page 9: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

Assume A and i belong to one partition

block (in red color) while B and j belong

to another block (in blue color).

After sampling and clustering on the

sampled graph, we have two clusters:

(1) A’ and j; (2) B’ and i, where A’∈A,

B’∈B.

Intuitively, the cluster blocks in the

sampled graph can represent the

original graph well

A good example why

we need parameter δ?

The main effect of δ-coverage is that the

measure is more lenient.

Based on δ-coverage, we design our new

metrics.

3、 New Structural Measures

9

A B

i

j

ij

A` B`

G

S

Page 10: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

2) δ-precision and δ-recall

Higher values of δ-precision mean that the clusters in

S are more precisely representative of the clusters in G.

Higher values of δ-recall indicate that clusters in S

more successfully cover clusters in G.

3、 New Structural Measures

10

Page 11: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

6

5

7

3 2

1

0

8 6

5

4

7

3 2

1

0

8

G S

{1}, {0,2,3} ∈ {0,1,2,3}

{5,6,7,8} ∈ {4,5,6,7,8}

δ-coverage =2 δ-precision =2/3

δ-recall =2/2

δ=1 Infer whether results

are representative?

11

An illustration example

Page 12: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

• Case study: streaming graph(a sequence of consecutively

arriving nodes/edges)

• Clustering algorithm: Structural-sampler[3], EAC[4] and METIS [5].

a) Quality test: we compare our new metrics compared with

the classic metrics ,i.e., cut-size and NMI. We run each

algorithm ten times, and then compare the average value of

those metrics.

b) Tuning test: The task is to evaluate the impact of the

random sampling threshold p on the clustering quality in

sampled graph S. We sample and set all edges with a

random value as the same manner as Structural-sampler.

4、Experiments and Analysis

12

Page 13: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

Figure shows that when the degree ratio μ changes from 0 to 0.5,

the value of the new metrics follows a downward trend.

The δ-precision and δ-recall metrics are more insightful

expression of clusters structure than the supervised metric NMI.

The unsupervised metric cut-size just estimates the edge-cut

value between the clusters.

4、Experiments and Analysis

a) Test quality on the benchmark datasets

13

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-pre

cis

ion(

=0.9

)

METIS

EAC

Structural-sampler

μ

δ-precision(δ=0.9)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-rec

all(

=0.9

)

METIS

EAC

Structural-sampler

δ-recall(δ=0.9)

μ0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NM

I

METIS

EAC

Structural-sampler

NMI

μ0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0

100

200

300

400

500

600

700

800

900

1000

Cut

-siz

e

METIS

EAC

Structural-sampler

Cut-size

μ

Page 14: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

b) Tuning test on the benchmark graphs which have

relatively distinct cluster structure (μ = 0.2)

For the online algorithms, the moderate sampling on the graph

with distinct cluster structure makes the cluster structure more

clear and obtain higher value of the δ-precision and δ-recall.

4、Experiments and Analysis

The cut-size decreases gradually while NMI metric just has a

slight increase when the sampling threshold p becomes smaller.

They can not capture the structure change appropriately. 14

δ-precision(δ=0.9)

0.4 0.5 0.6 0.7 0.8 0.9 1

Sample threshold p

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

METIS

EAC

Structural-sampler

δ-p

reci

son

(δ=0

.9)

δ-recall(δ=0.9)

0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

METIS

EAC

Structural-sampler

δ-r

eca

ll(δ

=0.9

)

Sample threshold p

NMI

0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

METIS

EAC

Structural-sampler

NM

I

Sample threshold p

Cut-size

0.4 0.5 0.6 0.7 0.8 0.9 10

100

200

300

400

500

600

700

800

900

METIS

EAC

Structural-sampler

Cu

t-si

ze

Sample threshold p

Page 15: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

We proposed two new structural measures, and they are

effectively reflect the match quality of the clusters in the

sampled graph with respect to the ground-truth clusters

in the original graph.

The experimental results indicate that classic metrics do

not share a common view of what a true clustering should

look like.

Our new metrics have a more insightful results and that

could be helpful when used as a complementary standard

measures.

Conclusion & Future Work

15

Page 16: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

Conclusion & Future Work

Our future work

We want to generalize our evaluation framework

and give concrete advise on different sampling

strategies and clustering approaches,

We also plan to extend our study of the fidelity of

the evaluation measures.

16

Page 17: Structural Measures of Clustering Quality on Graph Samplesinformatique.umons.ac.be/dbdbd2016/slides/JianpengZhangTUE.pdf · Structural Measures of Clustering Quality on Graph Samples

Dimension

reduciton

Classification Clustering

Big data

analytics

Streaming

data

Deep learning Manfold

learning

Predictive

analytics

Change

detection

Object

Recognition

Thanks for your attention!!!

Jianpeng Zhang

[email protected]