parameter-free hierarchical co-clustering by n -ary splits

23
Parameter-free Hierarchical Co-Clustering by n-Ary Splits Dino Ienco, Ruggero G. Pensa and Rosa Meo {ienco,pensa,meo}@di.unito.it University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled (Slovenia)

Upload: cargan

Post on 24-Feb-2016

22 views

Category:

Documents


0 download

DESCRIPTION

University of Turin , Italy Department of Computer Science. Parameter-free Hierarchical Co-Clustering by n -Ary Splits. Dino Ienco, Ruggero G. Pensa and Rosa Meo { ienco ,pensa, meo }@di.unito.it. ECML-PKDD 2009 – Bled (Slovenia). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

Parameter-free Hierarchical Co-Clustering

by n-Ary Splits

Dino Ienco, Ruggero G. Pensa and Rosa Meo{ienco,pensa,meo}@di.unito.it

University of Turin, ItalyDepartment of Computer Science

ECML-PKDD 2009 – Bled (Slovenia)

Page 2: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

Motivations

Our Idea

Co-Clustering and Background

Hierachical Co-Clustering

Results

Conclusions

Outline

Page 3: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

Co-Clustering:- effective approach that obtains interesting

results- Commonly involved with high-dimensional data- Partition simultaneously rows and columns

MOTIVATIONS

Motivations

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 4: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

Many Co-clustering algorithms:- Spectral approach (Dhillon et al. KDD01) - Information theoretic approach (Dhillon et al. KDD03) - Minimum Sum-Squared Residue approach (Cho et al.

SDM04 )- Bayesian approach (Shan et al. ICDM08)

MOTIVATIONSParameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 5: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

• All previous techniques: - require num. of row/column cluster

as parameter - produce flat partitions, without any

structure information

MOTIVATIONSParameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 6: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

MOTIVATIONS

In general:- parameters are difficult to set - structured output (like hierarchies) help the user to understand data

Hierarchical structures are useful to: - indexing and visualize data - explore the parent-child relationships - derive generalization/specialization

concept

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 7: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

OUR IDEA

PROPOSED APPROACH:- Extend previous flat co-clustering algorithm (Robardet02) to hierarchical setting

CO-CLUSTERING

+

HIERARCHICAL APPROACH

Build two hierarchies on

both dimensions SIMULTANEOU

SLY

ALLOWS

Our Idea

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 8: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

University of Turin, ItalyDepartment of Computer Science

ECML-PKDD 2009 – Bled (Slovenia)

Background

τ-CoClust (Robardet02):

- Co-Clustering for counting or frequency data

- No number of row/column clustering needed

- Maximize a statistical measure Goodman and Kruskal τ

between row and column partitions

CO-CLUSTERING

Page 9: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

University of Turin, ItalyDepartment of Computer Science

ECML-PKDD 2009 – Bled (Slovenia)

CO-CLUSTERING

Goodman and Kruskal τ :

- Measure the proportional reduction in the prediction error of a dep. Variable given an indep. Variable

F1 F2 F3

O1 d11 d12 d13

O2 d21 d22 d23

O3 d31 d32 d33

O4 d41 d42 d43

CF1 CF2

CO1 t11 t12 TO1

CO2 t21 t22 TO2

TF1 TF2

CO1 = {O1,O2}CF1 ={F2} CO2 = {O3,O4}CF2= {F1,F3}

Page 10: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

University of Turin, ItalyDepartment of Computer Science

ECML-PKDD 2009 – Bled (Slovenia)

CO-CLUSTERING

Goodman and Kruskal τ :- Measure the proportional reduction in the prediction error of a dep. Variable given an indep. Variable

Prediction error on CO without knowledge about CF partition

ECO

ECO|CF Prediction error on CO withknowledge about CF partition

Page 11: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

CO-CLUSTERING

Optimization strategy:

- τ is asymmetrical, for this reason the algorithm alternates the optimization of two functions τCO|CF and τCF|CO

- Stochastic optimization (example on rows):# Start with an initial parition on rowsfor i in 1..n_times # augment the current partition with an empty cluster# Move at random one element from a partition to another one# If obj. func. improve keep solution, else undo the operation # If there is an empty cluster, remove itend

- This optimization allows the num. of clusters to grow or decrease

- In (Robardet02) an efficient way to update incrementally the objective function was introduced

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 12: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

HIERARCHICAL CO-CLUSTERING

HiCC:

- Hierarchical Co-Clustering algorithm that extends τ-CoClust

- Divisive Approach

- No parameter settings needed

- No predefined number of splits for each node of the hierarchy

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

HIERARCHICAL CO-CLUSTERING

Page 13: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

HiCC:At the beginning use τ-CoClustrepeat

- From the current Row/Column partitions

- Fix the Column partition

- For each cluster in the Row partition Re-cluster with τ-CoClust and optimize the obj. func. τCO|CF

- Update Row Hierarchy

- Fix the new Row partition

- For each cluster in the Column partition Re-cluster with τ-CoClust and

optimize the obj. func. τCF|newCO

- Update Column Hierarchy

until (TERMINATION)

HIERARCHICAL CO-CLUSTERING

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 14: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

HIERARCHICAL CO-CLUSTERING

A SIMPLE EXAMPLE

Goes on … until termination condition is

satisfied

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 15: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

RESULTS

Experimentation:

- No previous hierarchical co-clustering algorithm exists

- Use a flat co-clustering algorithm with the same number of clusters obtained by our approach for each level

- We choose Information theoretic approach (KDD03) and for each level we perform 50 runs then we compute the average

- We use document-word dataset to validate our approach:* OHSUMED (collection of pubmed abstract) {oh0, oh15}

* REUTERS-21578 (collected and labeled by Carnegie Group) {re0, re1}

* TREC (Text Retrieval Conference) {tr11, tr21}

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 16: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

An example of row hierachy on OHSUMED

Enzyme-Activation

Enzyme-ActivationEnzyme-Activation

Enzyme-Activation

Cell-Movement Cell-Movement

Adenosine-Diphosphate

Staphylococcal-Infection

UremiaUremia

Staphylococcal-Infection

Staphylococcal-InfectionStaphylococcal-Infection

Memory

We label each cluster with the majority class

RESULTSParameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 17: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

An example of column hierachy on REUTERS

oil, compani, opec, gold, ga, barrel, strike, mine, lt, explor

tonne, wheate, sugar, corn, mln, crop, grain, agricultur, usda, soybean

coffee, buffer, cocoa, deleg, consum, ico, stock, quota, icco, produc

oil, opec, tax, price, dlr, crude, bank, industri, energi, saudi

compani, gold, mine, barrel, strike, ga, lt, ounce, ship, explor

tonne, wheate, sugar, corn, grain, crop, agricultur, usda, soybean, soviet

mln, export, farm, ec, import, market, total, sale, trader, trade

quota, stock, produc, meet, intern, talk, bag, agreem, negoti, brazil

coffee, deleg, buffer, cocoa, consum, ico, icco, pact, council, rubber

We label each cluster with top 10 words ranked by mutual information

RESULTSParameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 18: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

External Validation Indices:- Purity- Normalized Mutual Information (NMI)- Adjusted Rand Index

Hierarchical setting:We combine the hierarchical result with this formula

- is one of the external validation indices

- is a weight for the hierarchy level i, in our case αi is equal to 1/i

RESULTSParameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 19: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

RESULTS

Performance Results

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 20: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

RESULTS

Performance Results on re1 dataset

Parameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 21: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

ConclusionsWe propose:

- New approach to hierarchical co-clustering

- Parameter free

- No apriori fixed number of splits (n-ary splits)

- Obtains good results

- Builds simultaneously hierarchies on both dimensions

- Improve co-clustering results exploration

CONCLUSIONSParameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 22: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

Future works:

- Parallelize the algorithm to improve time performance

- Pushing constraints inside it to use background knowledge

- Extend the framework to manage continuous data

CONCLUSIONSParameter-free Hierarchical Co-Clustering by n-Ary Splits

Page 23: Parameter-free Hierarchical Co-Clustering by n -Ary Splits

ECML-PKDD 2009 – Bled (Slovenia)

Any Question?

Thank you for your attention

Parameter-free Hierarchical Co-Clustering by n-Ary Splits