18th international conference on database and expert systems applications journey to the centre of...

29
18th International Conference on Database and Expert Systems 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star Clustering Tok Wee Hyong Derry Tanti Wijaya Stéphane Bressan

Upload: shanna-walker

Post on 23-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

18th International Conference on Database and Expert Systems Applications

Journey to the Centre of the Star:Various Ways of Finding Star Centers in

Star Clustering

Tok Wee HyongDerry Tanti WijayaStéphane Bressan

Page 2: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Vector Space Clustering

• Naturally translates into a graph clustering problem for a dense graph

Vectors

Weight is cosine of

corresponding vectors

Page 3: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Star Clustering for Graph [1]

• Computes vertex cover by a simple computation of star-shaped dense sub-graphs

1. Lower weight edges are pruned

2. Vertices with higher degree (that are not satellites) are chosen in turn as Star centers

3. Vertices connected to a center become satellites

4. Algorithm terminates when every vertex is either a center or a satellite

5. Each center and its satellites form a cluster

Page 4: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Star Clustering

• Does not require the indication of an a priori number of clusters

• Allows clusters to overlap• Analytically guarantees a lower bound on the

similarity between objects in each cluster• Computes more accurate clusters than either the

single or average link hierarchical clustering

Page 5: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Star Clustering

• Two critical elements:

• Threshold for pruning edges (σ)• Metrics for selecting Star centers

• Aslam et al. [1] derived the theoretical lower bound on the expected similarity between two satellites in a cluster

• Empirically shown to be a good estimate of the actual similarity

• Current metrics for selecting Star centers does not leverage this finding

Our focus is on the metrics for selecting Star centers

Page 6: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Extended Star Clustering

• Choose Star centers using complement degree of vertices

• Allow Star centers to be adjacent to one another• Has two versions: unrestricted and restricted

Page 7: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Our proposal

• Degree may not be the best metrics• We propose metrics that considers weights of

edges in order to maximize intra-cluster similarity:• Markov Stationary Distribution• Lower Bound• Average• Sum

Page 8: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Markov Stationary Distribution

• Similar to the idea of Google’s Page Rank algorithm [2]

Method:• Similarity graph is normalized into a symmetric

Markov matrix• Compute the stationary distribution of the matrix

A* = (I – A) -1

• Vertices are sorted by their stationary values and chosen in turn as Star centers

Page 9: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Lower Bound

• Theoretical lower bound on expected similarity between satellite vertices:

cos(γi,j) ≥ cos(αi) cos(αj)+ (σ / σ + 1) sin(αi) sin(αj)

• Can be used to estimate the average intra-cluster similarity

• Lower bound metric is the estimated average intra-cluster similarity when v is a Star center and v.adj are its satelliteslb (v) = ((Σvi v.adj cos(αi)) 2 + (σ / σ + 1) (Σvi v.adj sin(αi)) 2) / n2

• Computed on the pruned graph

Page 10: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Average and Sum

• Approximations of the lower bound metric• Computed on the pruned graph• For each vertex v,

ave (v) = Σvi v.adj∈ cos(αi) / degree(v)

sum (v) = Σvi v.adj∈ cos(αi) • Average metric is the square root of the first

term in the lower bound metric

Page 11: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Markov, Lower Bound, Average, Sum Metrics

• We integrate our proposed metrics in the Star algorithm and its variants to produce:• Star-lb• Star-sum• Star-ave• Star-markov• Star-extended-sum-(r)• Star-extended-ave-(r) • Star-extended-sum-(u) • Star-extended-ave-(u) • Star-online-sum• Star-online-ave

Page 12: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Experiments

• Compare performance with off-line and on-line Star clustering and restricted and unrestricted Extended Star clustering

• Use data from Reuters-21578, Tipster-AP, and our original collection: Google

• Measure effectiveness: recall, precision, F1• Measure efficiency: running time • Measure sensitivity to σ

Page 13: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Off-line Algorithms

• Star-lb and Star-ave are most effective but Star-ave is much more efficient

• Star-random performs comparably to original Star when threshold σ is the average similarity

Page 14: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Off-line Algorithms

Effectiveness comparison

0

0.2

0.4

0.6

0.8

1

1.2

star

star-

ave

star-

sum

star-

mark

ov

star-

random

star-

lb

star

star-

ave

star-

sum

star-

mark

ov

star-

random

star-

lb

star

star-

ave

star-

sum

star-

mark

ov

star-

random

star-

lb

reuters tipster-ap google

PrecisionRecallF1

Page 15: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Off-line Algorithms

Efficiency comparison

0

50000

100000

150000

200000

250000

300000

350000

400000st

ar

star-

ave

star-

sum

star-

ma

rkov

star-

rand

om

star-

lb

star

star-

ave

star-

sum

star-

ma

rkov

star-

rand

om

star-

lb

star

star-

ave

star-

sum

star-

ma

rkov

star-

rand

om

star-

lb

reuters tipster-ap google

Tim

e (

ms)

Page 16: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Order of Stars

• We empirically demonstrate that Star-ave indeed approximates Star-lb better than other algorithms by a similar choice of Star centers

Page 17: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Order of Stars (on Tipster-AP)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52

iteration

exp

ecte

d s

imil

arit

y ra

nk

starstar-sumstar-markovstar-avestar-lb

Page 18: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Sensitivity to σ

As compared to the original Star:• Star-ave and Star-markov converge to a

maximum F1 at a lower threshold • The maximum F1 of Star-ave is higher • F1 gradient of Star-ave and Star-markov is

smaller

Page 19: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Sensitivity to σ (F1 on Reuters)

0

0.2

0.4

0.6

0.8

1

1.20

0.2

0.6

1 (m

ean) 1.2

1.6 2

2.5 3

3.5 4

4.5 5

5.5 6

6.5 7

7.5 8

8.5 9

9.5 10 20

s

F1

starstar-avestar-markovstar-sumstar-lb

σ

Page 20: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Sensitivity to σ (F1 gradient on Reuters)

0

0.1

0.2

0.3

0.4

0.5

0.60

0.2

0.6

1 (

me

an)

1.2

1.6 2

2.5 3

3.5 4

4.5 5

5.5 6

6.5 7

7.5 8

8.5 9

9.5 10

20

s

starstar-avestar-markovstar-sumstar-lb

σ

Page 21: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Extended Star

• Star-ave is more effective and efficient than Star-extended-(r)

• Star-extended-ave-(r) improves the effectiveness of Star-extended-(r)

• Similar findings are observed with Star-extended-(u)

Page 22: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Extended Star

Effectiveness comparison

0

0.2

0.4

0.6

0.8

1

1.2

sta

r-ave

sta

r-exte

nd

ed-(

r)

sta

r-exte

nded-a

ve-(

r)

sta

r-exte

nded-s

um

-(r)

sta

r-ave

sta

r-exte

nd

ed-(

r)

sta

r-exte

nded-a

ve-(

r)

sta

r-exte

nded-s

um

-(r)

sta

r-ave

sta

r-exte

nd

ed-(

r)

sta

r-exte

nded-a

ve-(

r)

sta

r-exte

nded-s

um

-(r)

reuters tipster-ap2 google

PrecisionRecallF1

Page 23: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Extended Star

Efficiency comparison

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000st

ar-a

ve

star

-ext

ende

d-(r

)

star

-ext

ende

d-av

e-(r

)

star

-ext

ende

d-su

m-(

r)

star

-ave

star

-ext

ende

d-(r

)

star

-ext

ende

d-av

e-(r

)

star

-ext

ende

d-su

m-(

r)

star

-ave

star

-ext

ende

d-(r

)

star

-ext

ende

d-av

e-(r

)

star

-ext

ende

d-su

m-(

r)

reuters tipster-ap2 google

Tim

e (m

s)

Page 24: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

On-line Algorithms

• Star-online-ave is more effective and efficient than the original Star on-line algorithm

Page 25: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

On-line Algorithms

0

0.2

0.4

0.6

0.8

1

1.2st

ar-o

nlin

e

star

-onl

ine-

ave

star

-onl

ine-

sum

star

-onl

ine-

rand

om

star

-onl

ine

star

-onl

ine-

ave

star

-onl

ine-

sum

star

-onl

ine-

rand

om

star

-onl

ine

star

-onl

ine-

ave

star

-onl

ine-

sum

star

-onl

ine-

rand

om

reuters tipster-ap google

PrecisionRecallF1

Effectiveness comparison

Page 26: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

On-line Algorithms

Efficiency comparison

0

50000

100000

150000

200000

250000

300000

350000

400000st

ar-

on

line

sta

r-o

nlin

e-

ave

sta

r-o

nlin

e-

sum

sta

r-o

nlin

e-

ran

do

m

sta

r-o

nlin

e

sta

r-o

nlin

e-

ave

sta

r-o

nlin

e-

sum

sta

r-o

nlin

e-

ran

do

m

sta

r-o

nlin

e

sta

r-o

nlin

e-

ave

sta

r-o

nlin

e-

sum

sta

r-o

nlin

e-

ran

do

m

reuters tipster-ap google

Tim

e (

ms

)

Page 27: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Conclusion

• Current metrics for selecting Star centers is not optimal

• We propose various new metrics for selecting Star centers that maximize intra-cluster similarity

• Average metrics is a fast and good approximation of lower bound metrics

• Since intra-cluster similarity is maximized, it is precision that is mostly improved

• Our proposed average metrics yield up to 19.1% improvement on precision for off-line algorithms, 20.9% improvement on precision for on-line algorithms, and 102% improvement on precision for extended star algorithm

Page 28: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

References

1. Aslam, J., Pelekhov, K., Rus, D.: The Star Clustering Algorithm. In Journal of Graph Algorithms and Applications, 8(1) 95–129 (2004)

2. Brin Sergey, Page Lawrence: The anatomy of a large-scale hypertextual Web search engine. Proceedings of the seventh international conference on World Wide Web 7, 107-117 (1998)

Page 29: 18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star

18th International Conference on Database and Expert Systems Applications

Credits

This work was funded

by the National University of Singapore ARG project R-252-000-285-112,

"Mind Your Language: Corpora and Algorithms

for Fundamental Natural Language Processing

Tasks in Information Retrieval

and Extraction for the Indonesian

and Malay languages"

Copyright © 2007 by Stéphane Bressan