sawtooth 2012 what's in a label

23
What’s in a Label? Business value of “soft” vs “hard” cluster ensembles solutions-2 Nicole Huyghe & Anita Prinzie

Upload: anita-prinzie

Post on 05-Dec-2014

376 views

Category:

Documents


0 download

DESCRIPTION

Business value of soft vs hard cluster ensembles

TRANSCRIPT

What’s in a Label? Business value of “soft” vs “hard” cluster

ensemblessolutions-2

Nicole Huyghe & Anita Prinzie

Answers the who and the why

Theme 1

Theme 2

...

Theme 3

Theme 9

Theme 10

Cluster Ensemble

HARD OR SOFT CLUSTER ENSEMBLE

Stability Integrity Accuracy Size

Stability

Similarity Index (Lange et al, 2004) indicates the percentage of pairs of observations that belong to the same cluster in both clustering C and clustering C’.

Cluster Integrity – Heterogeneity

Total separation of clusters: based on the distance between cluster centers

Cluster Integrity - Homogeneity

Scatter (compactness): average ratio of the cluster variance to the variance of the dataset.

Accuracy

Adjusted Rand Index (Hubert and Arabie, 1985): level of agreement between the predicted segment and the real segment correcting for the expected level of agreement.

1 2

38

7

9

4

5

6

1

2

38

7

9

4

56

Reality Prediction

Size

Uniformity deviation: average deviation from each segment from uniform segment size (1/number of segments).

Rheumatism

Osteoporosis

Software journey

Stability Heterogeneity

Accuracy Homogeneity

H>S H>S

H>S H>SS>H

S>HS>H

LC gives smaller segments

Soft CCEA

Soft LC

Hard LC

Hard CCEA

Rheumatism

OsteoporosisSoftware journey

Soft CCEA

Soft LC

Hard LC

Hard CCEA

MIXED EVIDENCE

Fixed Factors

x 10100 100 100 100

High

confidence

Low

confidence

High

confidence

Low

confidence

Sim. Index soft > hard

Sim. Index hard > soft

Stability: SOFT is better

Strong similarity

Weak similarity

High confidence

Low confidence

Homogeneity: SOFT is better

Scatter hard > soft

Strong similarity

Weak similarity

High confidence

Low confidence

Heterogeneity: Hard is better

Tot. Sep. soft > hard

Strong similarity

Weak similarity

High confidence

Low confidence

Size: Hard is better

Strong similarity

Weak similarity

Uni. dev. soft > hard

High confidence

Low confidence

HARD ENSEMBLES GIVE BETTER BUSINESS SEGMENTS

risingquestionsdo we cause

Anita Prinzie, Nicole [email protected]

www.solutions2.be

References

• Fred and Jain, Combining Multiple Clustering using Evidence Accumulation (2005), IEEE Transactions on Pattern analysis and Machine Intelligence, 27(6), 835-850.

• Lange, T., Roth., V., Braun L. And Buhmann J.M. (2004) , Stability-based validation of Clustering Solutions, Neural Computation, 16, 1299-1323.

• Haldiki, M.,Vazirgiannis M. and Batistakis, Y. (2000), Quality Scheme Assessment in the Clustering Process, Proc. Of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, 265-276.

• Hubert, L. And Arabie, P. (1985) Comparing partitions, Journal of Classification, 193-218.

• Nieweglowski, L., CLV package (2007), R software.• Martin, A., Quinn, K.M. And Park, J.H., Markov Chain Monte Carlo

Package (MCMCpack) (2003-2012), R software.