a two-way visualization method for clustered data

29
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology Advisor Dr. Hsu Presenter Keng-Wei Chang Author: Yehuda Koren and David H arel A Two-Way Visualization Method for Clustered Data ACM SIGKDD international conference on Knowledge discovery an d datamining

Upload: kitra-mayer

Post on 04-Jan-2016

31 views

Category:

Documents


2 download

DESCRIPTION

A Two-Way Visualization Method for Clustered Data. Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda Koren and David Harel. ACM SIGKDD international conference on Knowledge discovery and datamining. Outline. Motivation Objective Introduction Basic Notions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Advisor : Dr. Hsu

Presenter : Keng-Wei Chang

Author: Yehuda Koren and David Harel

A Two-Way Visualization Method for Clustered Data

ACM SIGKDD international conference on Knowledge discovery and datamining

Page 2: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation Objective Introduction Basic Notions Computing The x-Coordinates Computing The y-Coordinates Result Related Work Conclusions Personal Opinion

Page 3: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

A number of technological development have led to an explosion of raw data that has to be analyzed

We are especially interested in two families of tools in this domain

Clustering algorithms and data visualization methods

Page 4: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

in this paper, we integrate the two approacheshierarchical clustering depicted as a dendrogram

low-dimensional embedding

Page 5: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

A number of technological development have led to an explosion of raw data that has to be analyzed

We are especially interested in two families of tools in this domain

Clustering algorithms and data visualization methods

Clustering methods can be broadly classifiedHierarchical and partitional

Page 6: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

Our main interest here is hierarchical clustering

The clustering hierarchy is often visualized as a dendrogram

A full binary tree

has a significant disadvantagedoes not provide exploratory visual representations of the data itself

another issue is that of cluster validity

Page 7: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

we are particularly interested in methods for achieving a low-dimensional embedding of data

principal component analysis (PCA)

multidimensional scaling (MDS)

force-directed placement

solve some limitations of dendrogrambut, cannot utilize external clustering information

Page 8: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

for a demonstration of the relative merits of the two approaches

a dendrogram vs. a low-dimensional embedding

Page 9: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

in this paper, we integrate the two approacheshierarchical clustering depicted as a dendrogram

low-dimensional embedding

Page 10: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Basic Notions

given data about n elements {1,…,n}

relationships between pairs of elements are bydistances dij ≥ 0 or

similarities wij ≥ 0

2-dimentional embedding of the dataid defined by two vectors x, y Є

the coordinates of element i are ( xi, yi)

n

Page 11: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Computing The x-Coordinates

The embedding must place each element exactly below its corresponding leaf in the dendrogram

this means that the x-coordinate must corresponding leaf in the dendrogram

face the problem of computing the x-coordinates of the dendrogram leaves

preserves the relationships among the data as much as possible

Page 12: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Computing The x-Coordinates

we exhaust all the existing methods, opting for a twofold process

find the best orientation of the dendrogramthis step determines the ordering of the leaves

decide on the exact gaps between consecutive leaves in the ordering

Page 13: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

a dendrogram has 2n-1 different orientationsexample :

Page 14: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

one way of defining formally what should be considered a “good” ordering

associate a cost function with the dendrogram

such that finding the best ordering is equivalent to optimizing this function

be the classical minimum linear arrangement problem

ji

jiij

def

sim xxwxLA,

.

minimizes

Page 15: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

in our particular problemalso faced with an ordering task

a permutation of {1, …, n}

however, here we should not consider all possible permutations, but only agree with dendrogram’s structure

n! 2n-1

using dynamic programming, running time is exponential in the dendrogram’s height not in its size

Page 16: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

introduce an additional form of the cost function

ji

jiij

def

dist xxdxLA,

.

maximizes

Page 17: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

given an ordered dendrogram T

a node v

Leaves(v) : the set of leaves in the substree rooted by v

x be the ordering on the leaves

Let S be Leaves(v)L be the set of leaves of left of S

R be the set of leaves of right of S

if |L| = l, |S| = s, we have x(L) = {1,…,l},

x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}

Page 18: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

a key concept of the algorithm is local arrangement cost, defined as :

RS,ji RjLiijiij

Sji LjSiiijjiij

defT

swxslw

lxwxxwvLocalLA

,

, ,

if |L| = l, |S| = s, we have x(L) = {1,…,l}, x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}

Page 19: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

two additional related terms will be used

another term that will be used in the algorithm

RjSi

ij

defT

LjSiij

defT wvRightCutwvLeftCut

,,

,

ij

rightvLeavesjleftvLeavesi

wvInnerCut

..

Page 20: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Page 21: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Determining coordinates of the leaves

computing the exact gaps between each two consecutive leaves

example :

Page 22: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Determining coordinates of the leaves

a better approach is to take a weighted average over all influenced leaf pairs

ikij

kj

ikiji jk

d

jkgap

,

1

,

1

Page 23: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Computing The y-Coordinates

Principle component analysis

Classical multidimensional scaling

Eigen-projection

Stress minimization

Page 24: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Result

Odors datasetconsists of 30 volatile odorous pure chemicals

contains 262 elements, natural clusters : 30

use a UPGMA agglomerative clustering to construct

the dendrogram

Page 25: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Result

Iris datasetan example of discriminant analysis

contains 150 elements, natural clusters : 3

Page 26: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Result

Gene expression data : CDC15-synchronized cell cycle

a much larger dataset of gene-expression data

contains 6113 elements

Page 27: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Related Work

TreeViewdendrogram over a color-coded matrix

Page 28: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Discussion

success for integrating two key methods in exploratory data analysis

cluster analysis and low-dimensional embedding

two unique propertiesGuaranteed separation between any kind of given clusters

The ability to deal with a predefined hierarchical clustering

Page 29: A Two-Way Visualization Method for Clustered Data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Personal Opinion

Advantages─ has success for integrating two of clustering methods.─ more intuition in analyzing

Application─ Real data for clustering and analyzing.─ May solve the problem lack of clustering information

Limited ─ cannot show the real shape of clusters