hierarchical model-based clustering of large datasets through fractionation and refractionation

31
Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation This work has been supported by NSA grant 62-1942 Hierarchical Model- Based Clustering of Large Datasets Through Fractionation and Refractionation

Upload: aglaia

Post on 10-Feb-2016

32 views

Category:

Documents


3 download

DESCRIPTION

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation. Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University of Washington. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Jeremy Tantrum, Department of Statistics,

University of Washington

joint work with

Alejandro Murua & Werner StuetzleInsightful Corporation University of Washington

This work has been supported by NSA grant 62-1942

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and

Refractionation

Page 2: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Motivating Example

• Consider clustering documents• Topic Detection and Tracking corpus

• 15,863 news stories for one year from Reuters and CNN• 25,000 unique words• Possibly many topics

• Large numbers of observations• High dimensions• Many groups

Page 3: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Goal of Clustering

40 45 50 55

7476

7880

8284

Detect that there are 5 or 6 groupsAssign Observations to groups

Page 4: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

NonParametric Clustering

Premise: • Observations are sampled from a density p(x) • Groups correspond to modes of p(x)

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Page 5: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

NonParametric Clustering

Fitting: Estimate p(x) nonparametrically and find significant modes of the estimate

-10 -5 0 5 10

0.0

0.02

0.04

0.06

0.08

0.10

0.12

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Page 6: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Model Based Clustering

Premise: • Observations are sampled from a mixture density p(x) = g pg(x)• Groups correspond to mixture components

Page 7: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Model Based Clustering

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| |||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Fitting: Estimate g and parameters of pg(x)

Page 8: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Model Based Clustering

Fitting a Mixture of Gaussians • Use the EM algorithm to maximize the log

likelihood– Estimates the probabilities of each observation

belonging to each group– Maximizes likelihood given these probabilites

– Requires a good starting point

Page 9: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Model Based Clustering

Hierarchical Clustering• Provides a good starting point for EM

algorithm• Start with every point being it’s own cluster• Merge the two closest clusters

– Measured by the decrease in likelihood when those two clusters are merged

– Uses the Classification Likelihood – not the Mixture Likelihood

• Algorithm is quadratic in the number of observations

Page 10: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

| || | |||| | | |||||||||| |||||||| |||||| |||||||||||| ||||||||||||||||||| ||||||| || ||||||| || |||||| | |||| || || | |

p1(x)p2(x)

p (x)

Merge gives small decrease in likelihood

| |||||||||||| | | ||||||||||||||| || | | | | | |||| ||||| |||||||||||||||||||||||||||||||||||||||||||||||||| || | |

Merge gives big decrease in likelihood

Likelihood Distance

| |||||||||||| | | ||||||||||||||| || | | | | | |||| ||||| |||||||||||||||||||||||||||||||||||||||||||||||||| || | |

p1(x) p2(x)

p (x)

Page 11: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Bayesian Information Criterion

• Choose number of clusters by maximizing the Bayesian Information Criterion

– r is the number of parameters– n is the number of observations

• Log likelihood penalized for complexity

Page 12: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Fractionation

Original Data – size n

n/M fractions of size M

If n >M

M is the largest number of observations for which a hierarchical O(M2) algorithm is computationally feasible

Invented by Cutting, Karger, Pederson and Tukey for nonparametric clustering of large datasets.

n clusters(meta-obervations, i)

Partition each fraction into M clusters < 1

Page 13: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Fractionation– n meta-observations after the first round– 2n meta-observations after the second round– in meta-observations after the ith round

• For the ith pass, we have i-1n/M fractions taking O(M2) operations each

• Total number of operations is:

• Total running time is linear in n!

Page 14: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

• Use model based clustering• Meta-observations contain all sufficient

statistics – (ni, i, i)– ni is the number of observations – size

– i is the mean – location

– i is the covariance matrix – shape and volume

Model Based Fractionation

Page 15: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

• ••••

•••

••

•••

••

••

••

••

•••

••

••••

••

•••

••

••

••

••

••

••••

••

••••

••

••

••

•••

••

••

••

••••

••

••

••

••

••

••

••

••

•• ••

••••

••

•••

•••

••••

••

••

•••

••

••

•••

••

••

••

••

••

••

••

••

••

••••

••

••

••

••

••••

••

••

••

•••

••

••

••

•••

••

••

••

••

••

••••

••

••

••

••

An example, 400 observations in 4 groupsObservations in the first fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

••

••

••

••

••••••

••

•••

••

••

••••

••••

••

••

••

••

••

••••

••

•••

•••••

••

10 meta-observations from the first fraction10 meta-observations from the second fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

•••

••

••

••

••

••

•••

••

•••

••

••

••

••

••

••

••

••

•••

••

10 meta-observations from the third fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••••

••

••

•••

• •

10 meta-observations from the fourth fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

•••

•••

••

••

••

••

•••

• •

••• ••

•••

••

••

••

••

The 40 Meta-observations

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

The Final Clusters Chosen by BICSuccess!

Model Based Fractionation

Page 16: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

The data – 400 observations in 25 groups

1 2 3 4 5

12

34

5 •

••

••

••

••

••

••

••

•••

••••

••

••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

•••

••

••

••

••

••••

••

•••

••

••

••

••••

•••

••

••

•••

1 2 3 4 5

12

34

5 •

••

••

••

••

••

•••

••

••

••

••

••

••

•••

••

Observations in fraction 110 meta-observations from the first fraction10 meta-observations from the second fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

••

•••

•••

• •

••

••

••

••

••

••

10 meta-observations from the third fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

••

••

••

•••

•••

••••

••

••

••

••

••

••

••

•••

••

•••

10 meta-observations from the fourth fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

••

••

••

••••

••

••

• •

••

••

••

•••

••

••

•••

•••

••

•••••

The 40 meta-observations

1 2 3 4 5

12

34

5

The clusters chosen by BICFractionation fails!

Example 2

Page 17: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Refractionation

Problem:• If the number of meta-observations generated from a

fraction is less than the number of groups in that fraction then two or more groups will be merged.

• Once observations from two groups are merged they can never be split again.

Solution:• Apply fractionation repeatedly.• Use meta-observations from the previous pass of

fractionation to create “better” fractions.

Page 18: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Example 2 Continued

1 2 3 4 5

12

34

5

The 40 meta-observations4 new clusters4 new fractions

1 2 3 4 5

12

34

5 • ••

••

••

••

•••

••

•••

•••••••

••

••

••

••

••

••

•••

•••

••

••

••

••

••

••

••••

•••

••

•••

••••

••

••

••

••

••

••

•••••

••

••

••

••

••

••

••

•••

••

••

•••

•••

••

•••

••

•••

•••

• •

••

•••

••

••

••

••

••

••

•••

••

••••

••••

•••

•••

••

••

••

••

••

•••

• •••

••

••

•••••

••• •

••••

••

•••

••

••

••

••

••

••••

••••

•••

••••

••••

••••••

••

••

Page 19: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Observations in the new fraction 1

1 2 3 4 5

12

34

5

• ••

••

••

•••

••

•••

•••••••

••

••

••

••

•••

•••

••

••

••

•••

•••

••

•••

••••

••

••

••

••

••

••

•••••

••

••

••

••

••

••

••

•••

••

••

•••

•••

••

•••

••

•••

•••

• •

••

••

••

••

•••

••

• ••

••••

••

•••

••

••

•••

•••

••

••••

••

••••

••

•••

••

••

••

••••

•••

•••

•••

•••

••

1 2 3 4 5

12

34

5

• ••

••

••

•••

••

•••

•••••••

••

••

••

••

•••

•••

••

••

••

•••

•••

••

•••

••••

••

••

••

••

••

••

•••••

••

••

••

••

••

••

••

•••

••

••

•••

•••

••

•••

••

•••

•••

• •

••

••

••

••

•••

••

• ••

••••

••

•••

••

••

•••

•••

••

••••

••

••••

••

•••

••

••

••

••••

•••

•••

•••

•••

••

Clusters from the first fractionClusters from the second fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

•••

• •

••

•••

••

••

••

••

••

••

•••

••

••••

••••

•••

•••

••

••

••

••

••

•••

Clusters from the third fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

• •••

••

••

•••••

••• •

••••

••

•••

••

••

••

••

••

••••

••••

•••

••••

••••

••••••

••

••

Clusters from the fourth fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

•• ••

••

••

••

•••

••

•••

•••••••

••

••

••

••

••

••

•••

•••

••

••

••

••

••

••

••••

•••

The 40 meta-observations

1 2 3 4 5

12

34

5

Clusters chosen by BIC

Example 2 – Pass 2

Page 20: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

The 40 meta-observations of pass 2 of fractionation

1 2 3 4 5

12

34

5

4 new clusters

1 2 3 4 5

12

34

5

4 new fractions

1 2 3 4 5

12

34

5

••

••

•••••

••

•••••

••

•••

•••••

••••

••

••

••

•••

••

••

•••

•••

••

• ••

••

••

••

•••

••

•••

••• •

••••

•••

•••

••

•••

••

••

••

••••

••

•••

••

•••

••

••

••••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

••

•••

••

•••

••

•••••

••••

•••

••

•••

••

•••

••

••

••

••

••

••

••

•••

••••••

••

••

••

••

••

••

••

•••

••••

1 2 3 4 5

12

34

5

••

••

•••••

••

•••••

••

•••

•••••

••••

••

••

••

•••

••

••

•••

•••

••

• ••

••

••

••

•••

••

•••

•••

••

••

••

••

•••

••••••

••

••

••

••

•••

••••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

Observations in the new fraction 1Clusters from the first fractionClusters from the second fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

••

•••

••

•••

••

•••••

••••

•••

••

•••

••

•••

••

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

•••

•••

••

••••

•••

•••

••

•••

••

••••

••

•••

••

••

••

•••

Clusters from the third fraction

1 2 3 4 5

12

34

5

••••

•••

•••

••

•••

••

••

••

••••

••

•••

••

•••

••

••

••••

•••

••

••

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

•••

••

• ••

••

••••

•••

•••

•••

Clusters from the fourth fraction

1 2 3 4 5

12

34

5 •

••

••

••

••

••

••

•••

••••••

••

••

••

••

••

••

••

•••

••••

••

••••

•••••

•••

••

••

••

••

•••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

The 40 meta-observations

1 2 3 4 5

12

34

5

Clusters chosen by BICRefractionation Succeeds

Example 2 – Pass 3

Page 21: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Realistic Example• 1100 documents from the TDT corpus

partitioned by people into 19 topics– Transformed into 50 dimensional space using Latent

Semantic Indexing

••

•••

•••

••

••

••

••

•••

••

••

••

••

••

••

•••

••

••

•••

••

••

••

•••••

••

••••

••

••

••

••

••

••

••

•••

••

••••

••

••

••

••

•••

••••

•••

••

••

•••

•••

••

•••

••

••

••

••

•••

••

•••

•••

••

••

••••

••

••

•••

••

••

•••

••

••

••••

••

••

••

••

••

••

••

•••

•••

••

••

••

••

••

••

••

••

••

••

•••

••

••

•••

•••••••

•• •

••

•••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• ••••••

••••••

•••

••••

••

••

•• ••

• ••••

••• ••••

••••

•••

••••

••••••••••••

••

•••••

••

••

••••••

••

•••••

••••••

••

•••

••

••••••

••••

••••

••

••••

••••••

••

•••

••••••

•••

••

•••

••

••

•••••

••

••

••

•••

••

••

••

••

••

•••

••

••

••

••

••

•••

••

••

•••

••

••

••

••••

••

••

••

••

••

••••

••

••••••••••••• ••••

••••••

••••

••

••••

••••••••••••••••••••••••••••••••••••

••••

••

••••••••••

••••••••••

••

••

••••••

••

•• •

••••

•••

••

••

••

••

••

•••••

••••

•••

••

••••••

••

•••

•••••

•••••••

•••

••••

••

•••••

••

• •••• •••••••••••••••••••••••••••••••••••••••••

••

•••

••••

••

Projection of the dataonto a plane – colorsrepresent topics

Page 22: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Realistic ExampleWant to create a dataset with more observations and more groupsIdea: Replace each group with a scaled and transformed version of the entire data set.

Page 23: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Realistic ExampleWant to create a dataset with more observations and more groupsIdea: Replace each group with a scaled and transformed version of the entire data set.

Page 24: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Realistic ExampleTo measure similarity of clusters to groups:Fowlkes-Mallows index• Geometric average of:

– Probability of 2 randomly chosen observations from the same cluster being in the same group

– Probability of 2 randomly chosen observations from the same group being in the same cluster

• Fowlkes–Mallows index near 1 means clusters are good estimates of the groups

• Clustering the 1100 documents gives a Fowlkes–Mallows index of 0.76 – our “gold standard”

Page 25: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Realistic Example• 19£19=361 clusters, 19£1100=20900 observations in

50 dimensions• Fraction size¼1000 with 100 metaobservations per

fraction• 4 passes of fractionation choosing 361 clusters

Pass Min Median Max nf

1 270 289 296 202 18 88 150 183 18 19 60 174 19 19 58 16

Distribution of the number of groups per fraction.

Number of fractions

Page 26: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Realistic Example

Pass Fowlkes Mallows

Purity of the clusters

1 0.325 17292 0.554 9083 0.616 6714 0.613 651

The sum of the number of groups represented in each cluster:• 361 is perfect

• 19£19=361 clusters, 19£1100=20900 observations in 50 dimensions

• Fraction size¼1000 with 100 metaobservations per fraction

• 4 passes of fractionation choosing 361 clusters

Page 27: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Realistic Example• 19£19=361 clusters, 19£1100=20900 observations in

50 dimensions• Fraction size¼1000 with 100 metaobservations per

fraction• 4 passes of fractionation choosing 361 clusters

Refractionation:• Purifies fractions• Successfully deals with the case where the number of

groups is greater than M, the number of meta-observations

Page 28: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Contributions

• Model Based Fractionation– Extended fractionation idea to parametric setting

• Incorporates information about size, shape and volume of clusters

• Chooses number of clusters– Still linear in n

• Model Based ReFractionation– Extended fractionation to handle larger number of

groups

Page 29: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Extensions

• Extend to 100,000s of observations – 1000s of groups– Currently the number of groups must be less

than M• Extend to a more flexible class of models

– With small groups in high dimensions, we need a more constrained model (fewer parameters) than the full covariance model

– Mixture of Factor Analyzers

Page 30: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation
Page 31: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Fowlkes-Mallows Index

Pr(2 documents in same group | they are in the same cluster)

Pr(2 documents in same cluster | they are in the same group)

true clustersGroups 1 2 … I Total

1 n11 n12 … n1I n1¢

2 n21 n22 … n2I n1¢

… … … … … …J nJ1 nj2 … nJI n1¢

Total n¢1 n¢2 … n¢I n