hierarchical model-based clustering of large datasets through fractionation and refractionation...

20
Hierarchical Model-Based Cl ustering of Large Datasets Through Fractionation and R efractionation Advisor Dr. Hsu Graduate You-Cheng Che n Author Jeremy Tantrum Alejan dro Murua Werner Stu etzle

Upload: kerrie-walker

Post on 18-Jan-2016

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Advisor : Dr. HsuGraduate : You-Cheng ChenAuthor : Jeremy Tantrum

Alejandro Murua Werner Stuetzle

Page 2: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

Motivation Objective Introduction Model-based Fractionation Model-based ReFractionation Example Conclusions Personal Opinion

Outline

Page 3: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

Motivation

Propose a extended method to improve performance of model-based clustering method and apply it to large datasets.

Page 4: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

Objective

Apply Fractionation and Refractionation to model-based clustering.

Page 5: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

Introduction

Model-based clustering in a nutshell

Sample: nxxx ,...,, 21

)(xpg is the density modeling group g

g is the prior probability that a randomlychosen observation belongs to group g

)()(1

xpxp gg

G

g

Page 6: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

Introduction

Model-based clustering in a nutshell

G

gggig

n

i

uxL11

;log

We can use Approximate Weight of Evidence to estimate the number of groups.

)))log(2/3(2)(2(maxarg nrGLG G

where

Page 7: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

Introduction Previous work on model-based clustering

for large datasets

Scalable EM(SEM) algorithm can be used to finding fitting mixture models to large datasets but it can’t estimate the number of groups.

The simplest and potentially fastest is to draw a sample of the data.

Page 8: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

Original Fractionation algorithm

2. Fractionation

1 Split data into fractions of size M2 Cluster each fraction into a fixed number M where a < 1. Summarize each cluster by its mean We refer to these cluster means as meat-observations.

3 If the total number of meta-observations is greater that M return to setp1

4 Cluster the meta-observations into G clusters.5 Assign each individual observation to the cluster with the closet mean.

Page 9: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

In model-based Fractionation, we use all sufficient the mean,the covariance,and the number of observations to present cluster.

Using AWE to determine the number of clusters in Step 4

2-1. Model-based Fractionation Main difference:

Page 10: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

3. Model-based ReFractionation Step 4 of Fractionation algorithm is replaced by 4a,4b

4a Clustering the meta-observations into G clusters,

where G is determined by AWE criterion

4b Define the fractions for the i-th pass.

Page 11: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

3.1 Illustration

M=100 fraction=4 meta-observation=40

Page 12: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

3.1 Illustration

Step 4a Use AWE find G=25

Step 4b

Page 13: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

3.1 IllustrationSecond pass

Page 14: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

3.1 Illustration

2th pass 3th pass

Page 15: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

3.2 Scope of (Re)Fractionation

Let ng be the number of groups in the data nf be the number of fractions nc be the number of clusters generated from each fraction Step2

If ng > nc will bead to impure clusters.

Page 16: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

4. Example

4.1 Measuring the agreement between groups and clusters

gj i

nnn jjjg

,222 )()(/)( ..Fowlkes-Mallows index=

Page 17: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

4.3 Example 1

Group = 19 n=22000 M=1000 clusters=100

Page 18: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

4.3 Example 3

Group=361 n=20900 M=1045 cluster=100

Page 19: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

Conclusions

We can study the performance of the AWE criterion for estimating the number of groups in a mixture of factor analyzers model.

Page 20: Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy

Personal Opinion

We can apply advantage of another clustering methodto improve ours defect.