parameter-free hierarchical co-clustering by n -ary splits
DESCRIPTION
University of Turin , Italy Department of Computer Science. Parameter-free Hierarchical Co-Clustering by n -Ary Splits. Dino Ienco, Ruggero G. Pensa and Rosa Meo { ienco ,pensa, meo }@di.unito.it. ECML-PKDD 2009 – Bled (Slovenia). - PowerPoint PPT PresentationTRANSCRIPT
Parameter-free Hierarchical Co-Clustering
by n-Ary Splits
Dino Ienco, Ruggero G. Pensa and Rosa Meo{ienco,pensa,meo}@di.unito.it
University of Turin, ItalyDepartment of Computer Science
ECML-PKDD 2009 – Bled (Slovenia)
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
Motivations
Our Idea
Co-Clustering and Background
Hierachical Co-Clustering
Results
Conclusions
Outline
ECML-PKDD 2009 – Bled (Slovenia)
Co-Clustering:- effective approach that obtains interesting
results- Commonly involved with high-dimensional data- Partition simultaneously rows and columns
MOTIVATIONS
Motivations
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
Many Co-clustering algorithms:- Spectral approach (Dhillon et al. KDD01) - Information theoretic approach (Dhillon et al. KDD03) - Minimum Sum-Squared Residue approach (Cho et al.
SDM04 )- Bayesian approach (Shan et al. ICDM08)
MOTIVATIONSParameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
• All previous techniques: - require num. of row/column cluster
as parameter - produce flat partitions, without any
structure information
MOTIVATIONSParameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
MOTIVATIONS
In general:- parameters are difficult to set - structured output (like hierarchies) help the user to understand data
Hierarchical structures are useful to: - indexing and visualize data - explore the parent-child relationships - derive generalization/specialization
concept
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
OUR IDEA
PROPOSED APPROACH:- Extend previous flat co-clustering algorithm (Robardet02) to hierarchical setting
CO-CLUSTERING
+
HIERARCHICAL APPROACH
Build two hierarchies on
both dimensions SIMULTANEOU
SLY
ALLOWS
Our Idea
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
University of Turin, ItalyDepartment of Computer Science
ECML-PKDD 2009 – Bled (Slovenia)
Background
τ-CoClust (Robardet02):
- Co-Clustering for counting or frequency data
- No number of row/column clustering needed
- Maximize a statistical measure Goodman and Kruskal τ
between row and column partitions
CO-CLUSTERING
University of Turin, ItalyDepartment of Computer Science
ECML-PKDD 2009 – Bled (Slovenia)
CO-CLUSTERING
Goodman and Kruskal τ :
- Measure the proportional reduction in the prediction error of a dep. Variable given an indep. Variable
F1 F2 F3
O1 d11 d12 d13
O2 d21 d22 d23
O3 d31 d32 d33
O4 d41 d42 d43
CF1 CF2
CO1 t11 t12 TO1
CO2 t21 t22 TO2
TF1 TF2
CO1 = {O1,O2}CF1 ={F2} CO2 = {O3,O4}CF2= {F1,F3}
University of Turin, ItalyDepartment of Computer Science
ECML-PKDD 2009 – Bled (Slovenia)
CO-CLUSTERING
Goodman and Kruskal τ :- Measure the proportional reduction in the prediction error of a dep. Variable given an indep. Variable
Prediction error on CO without knowledge about CF partition
ECO
ECO|CF Prediction error on CO withknowledge about CF partition
ECML-PKDD 2009 – Bled (Slovenia)
CO-CLUSTERING
Optimization strategy:
- τ is asymmetrical, for this reason the algorithm alternates the optimization of two functions τCO|CF and τCF|CO
- Stochastic optimization (example on rows):# Start with an initial parition on rowsfor i in 1..n_times # augment the current partition with an empty cluster# Move at random one element from a partition to another one# If obj. func. improve keep solution, else undo the operation # If there is an empty cluster, remove itend
- This optimization allows the num. of clusters to grow or decrease
- In (Robardet02) an efficient way to update incrementally the objective function was introduced
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
HIERARCHICAL CO-CLUSTERING
HiCC:
- Hierarchical Co-Clustering algorithm that extends τ-CoClust
- Divisive Approach
- No parameter settings needed
- No predefined number of splits for each node of the hierarchy
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
HIERARCHICAL CO-CLUSTERING
ECML-PKDD 2009 – Bled (Slovenia)
HiCC:At the beginning use τ-CoClustrepeat
- From the current Row/Column partitions
- Fix the Column partition
- For each cluster in the Row partition Re-cluster with τ-CoClust and optimize the obj. func. τCO|CF
- Update Row Hierarchy
- Fix the new Row partition
- For each cluster in the Column partition Re-cluster with τ-CoClust and
optimize the obj. func. τCF|newCO
- Update Column Hierarchy
until (TERMINATION)
HIERARCHICAL CO-CLUSTERING
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
HIERARCHICAL CO-CLUSTERING
A SIMPLE EXAMPLE
Goes on … until termination condition is
satisfied
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
RESULTS
Experimentation:
- No previous hierarchical co-clustering algorithm exists
- Use a flat co-clustering algorithm with the same number of clusters obtained by our approach for each level
- We choose Information theoretic approach (KDD03) and for each level we perform 50 runs then we compute the average
- We use document-word dataset to validate our approach:* OHSUMED (collection of pubmed abstract) {oh0, oh15}
* REUTERS-21578 (collected and labeled by Carnegie Group) {re0, re1}
* TREC (Text Retrieval Conference) {tr11, tr21}
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
An example of row hierachy on OHSUMED
Enzyme-Activation
Enzyme-ActivationEnzyme-Activation
Enzyme-Activation
Cell-Movement Cell-Movement
Adenosine-Diphosphate
Staphylococcal-Infection
UremiaUremia
Staphylococcal-Infection
Staphylococcal-InfectionStaphylococcal-Infection
Memory
We label each cluster with the majority class
RESULTSParameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
An example of column hierachy on REUTERS
oil, compani, opec, gold, ga, barrel, strike, mine, lt, explor
tonne, wheate, sugar, corn, mln, crop, grain, agricultur, usda, soybean
coffee, buffer, cocoa, deleg, consum, ico, stock, quota, icco, produc
oil, opec, tax, price, dlr, crude, bank, industri, energi, saudi
compani, gold, mine, barrel, strike, ga, lt, ounce, ship, explor
tonne, wheate, sugar, corn, grain, crop, agricultur, usda, soybean, soviet
mln, export, farm, ec, import, market, total, sale, trader, trade
quota, stock, produc, meet, intern, talk, bag, agreem, negoti, brazil
coffee, deleg, buffer, cocoa, consum, ico, icco, pact, council, rubber
We label each cluster with top 10 words ranked by mutual information
RESULTSParameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
External Validation Indices:- Purity- Normalized Mutual Information (NMI)- Adjusted Rand Index
Hierarchical setting:We combine the hierarchical result with this formula
- is one of the external validation indices
- is a weight for the hierarchy level i, in our case αi is equal to 1/i
RESULTSParameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
RESULTS
Performance Results
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
RESULTS
Performance Results on re1 dataset
Parameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
ConclusionsWe propose:
- New approach to hierarchical co-clustering
- Parameter free
- No apriori fixed number of splits (n-ary splits)
- Obtains good results
- Builds simultaneously hierarchies on both dimensions
- Improve co-clustering results exploration
CONCLUSIONSParameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
Future works:
- Parallelize the algorithm to improve time performance
- Pushing constraints inside it to use background knowledge
- Extend the framework to manage continuous data
CONCLUSIONSParameter-free Hierarchical Co-Clustering by n-Ary Splits
ECML-PKDD 2009 – Bled (Slovenia)
Any Question?
Thank you for your attention
Parameter-free Hierarchical Co-Clustering by n-Ary Splits