radviz extensions with applications

RadViz Extensions with Applications

Dissertation DefenseJohn Sharko

October 26, 2009

Committee

• Prof. Georges Grinstein (Advisor)• Prof. Kenneth Marx• Prof. Haim Levkowitz• Dr. Patrick Hoffman• Dr. Alex Gee

Outline

• Introduction– RadViz– Cluster Ensembles– Fuzzy Clusters

• Methodology• Contributions• Recommendations

RadViz Example

Description of Traditional RadViz

Each dimension in a dataset is represented by a point, called an anchor point, on the circumference of a circle.

Each record in the dataset is positioned as if it were being pulled by a spring attached to each anchor point where the strength of the spring is proportional to that record’s coordinate or value for the dimension related to that anchor point.

RadViz ExampleAll Coordinate Values Equal

RadViz ExampleTwo Coordinate Values Equal

RadViz ExampleRange of Coordinates Values

Terminology

• Dimensional Anchor (Anchor Point)– point on the circle representing a dimension

• Point – representation of record(s) within the circle

RadViz Mathematical Formulation

i ,...,1,cos

,...,1,

where: xi and yi are the resulting transformed coordinates for record iθj is the angular position on the circle corresponding to dimension jai,j is the value for dimension j for record id is the number of dimensions and n the number of records.

i ,...,1,sin

,...,1,

Impact of Exchanging Dimensional Anchors

(1, 0, 1, 0)

Example of Repositioning Anchor Points Using Layout Algorithm

Before repositioning After repositioning

Multiple Clustered Datasets

• Clustering algorithms are heuristic, not optimal

• Different clustering algorithms tend to generate different clusters

Sample Multiple Clustered Dataset

Record Algorithm A Algorithm B Algorithm C

1 1 2 3

2 1 2 3

3 1 2 1

4 2 1 1

5 2 3 2

6 2 3 4

Stable Group of RecordsRecord Algorithm A Algorithm B Algorithm C

1 1 2 3

2 1 2 3

3 1 2 1

4 2 1 1

5 2 3 2

6 2 3 4

Uniquely Clustered RecordRecord Algorithm A Algorithm B Algorithm C

1 1 2 3

2 1 2 3

3 1 2 1

4 2 1 1

5 2 3 2

6 2 3 4

Fuzzy Clusters

• A record belongs to multiple clusters• Varying strengths of association

Record Cluster 1 Cluster 2 Cluster 3 Cluster 4

1 .8 .1 .1 0

2 .5 .4 0 .1

3 .3 .2 .3 .2

Cluster Ensemble vs. Fuzzy Clustering

Cluster ensemble Fuzzy cluster

Multiple instances of hard clustering

A record belongs to multiple clusters

Each record is assigned to one cluster in each instance Varying strengths of association

Using RadViz to Analyze Multiple Clustered Datasets

• RadViz typically deals with real numbers

• Cluster number just does not work

• How do you produce a meaningful RadViz visualization?

Flattening of Categorical Data

• Break up each original dimension into multiple dimensions

• Each new dimension represents a value of the original dimension

Flattening a Dimension

Original

Manufacturer

Flattened

Manufacturer

Sporty

Original Record: (Cadillac, Deville, Large, 33)

Flattened Record: (Cadillac, Deville, 0, 1, 0, 0, 33)

Flattening Multi Cluster DatasetOriginal Dimensions

Flattened Dimensions

Algorithm A

Algorithm B

Algorithm C

(2, 1, 4) (0, 1, 1, 0, 0, 0, 0, 0, 1 )SampleRecord:

A B C { { {A B C

Simple Example

• Iris dataset• Three cluster sets

– KM1: K-means clustering with 1000 iterations– KM2: K-means clustering with 100,000 iterations– HC: hierarchical clustering

• Ten clusters per cluster set

KM1 Color Scale

Flattened Multi-cluster Iris Dataset

KM1 Color Scale

Flattened Multi-cluster Iris Dataset - Jittered

KM1 Color Scale

Flattened Multi-cluster Iris Dataset

Repositioning Dimensional Anchors

• Move points away from the center

• Separate points

• Increase displayed information content

Class Discrimination Layout Algorithm

•Select a dimension that classifies the records•Assign each dimension to the class with the highest values with respect to the other classes•Move the dimensional anchors assigned to the same class next to each other to form a classification sector

Example of Class Discrimination Layout Algorithm

Before After

Classification Sector 2

Classification Sector 1Class

KM1 Cluster Size

30 records

20 records

10 records

5 records

After Repositioning Dimensional Anchors

KM1 Cluster Size

30 records

20 records

10 records

5 records

After Repositioning Dimensional AnchorsKM1-2

KM1 Cluster Size

30 records

20 records

10 records

5 records

KM1 Cluster Size

30 records

20 records

10 records

5 records

KM1 Cluster Size

30 records

20 records

10 records

5 records

KM1 Cluster Size

30 records

20 records

10 records

5 records

KM1 Cluster Size

30 records

20 records

10 records

5 records

Moving Similar Classification Sectors Close to Each Other

-Dimensions have been grouped together into classification sectors

-Determine which record classes are most similar to each other using Euclidean distances

-Move those dimension sectors closer to each other using greedy algorithm

-Records will tend to be moved away from the center

KM1 Color Scale

10 HC-8, HC-7KM2-3, KM2-8 KM2-1, HC-10KM2-9, KM1-6

HC-2KM2-4

KM2-6HC-1

KM1-9KM2-7 HC-4 KM1-10

HC-5KM2-10

KM1-3KM2-5KM1-7

Repositioning Classification Sectors

Interpreting Vectorized RadViz

Sepal length

Petal length

SetosaVirsicolorVirginica

Interpreting VRV

Sepal length

Petal length

Interpreting VRV

Sepal length

Petal length

Interpreting VRV

Sepal length

Petal length

Interpreting VRV

Sepal length

Petal length

Salamander Gene Expression Levels

Salamander Class 9 GenesNvg00226

Nvg00155

Nvg00111

Nvg00091

Salamander Class 9 Genes

• Nvg00111– “Key” gene– CXC chemokine, ligand 10

• Nvg00226– No homology

• Nvg00155– Keratin type II cytoskeletal

• Nvg00091– Annexin

Fuzzy Clusters

Description of Fuzzy Clusters

• K-means clustering algorithm used• Four clusters• Applied to Iris dataset

Cluster 1

Cluster 4

Cluster 3

Cluster 2

Setosa

Versicolor

Virginica

Outlier

Area of Versicolor and Virginica overlap

RadViz Visualization of Fuzzy Clusters

Sepal Length

Setosa

Versicolor

Virginica

Outlier

Scatterplot Visualization of Iris Dataset

Sepal Length

gthCluster 1

Cluster 4

Cluster 3

Cluster 2

Comparing Visualizations of Fuzzy Clusters

Cluster 1

Cluster 4

Cluster 3

Cluster 2

Setosa

Versicolor

Virginica

Virginica outlier

Overlap

Central

RadViz Visualization of Fuzzy Clusters

Setosa

Versicolor

VirginicaKey to dimension labeling: Cluster Set-Cluster Numbere.g. KM1-3 is Kmeans set 1 cluster number 3

Virginica outlier

Overlap Central

Vectorized RadViz Visualization of Iris Cluster Ensemble

Cluster 1

Cluster 4

Cluster 3

Cluster 2

Comparison of RadViz Visualizations

Fuzzy Clusters Cluster Ensemble - VRV

Virginica outlier

Cluster 1

Cluster 4

Cluster 3

Cluster 2

Comparison of RadViz Visualizations

Fuzzy Clusters Cluster Ensemble - VRV

Central

Group AGroup B

Group C

RV Visualization of Fuzzy ClustersNewt Microarray Dataset

Group A

Group B

Group C

VRV Visualization of Cluster EnsembleNewt Microarray Dataset

Decision Trees

Decision to Play Tennis

Day Outlook Temperature Humidity Wind Play Tennis1 Sunny Hot High Weak No2 Sunny Hot High Strong No3 Overcast Hot High Weak Yes4 Rain Mild High Weak Yes5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes8 Sunny Mild High Weak No9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No

VRV Applied to an Ordered Numerical Dataset

Adult Income DatasetIncome category (<$50,000, >$50,000)

as a function of:

AgeWork classEducationMarital statusOccupationRelationshipRaceGenderCapital gainCapital lossHours per weekNative country

VRV Applied to the Adult Dataset

< $50,000

> $50,000

VRV as a Classifier

•< $ 50,000 48% correct•> $ 50,000 89% correct

Records Predicted as High IncomeModerate Case

Records Predicted as High IncomeExtreme Case

Summary of Results of VRV Classification of Adult Dataset

<=50K >50K Total0

102030405060708090

Split in halfIncreased low incomeExtreme low income

Income Category

Percent Correct

Summary of Results of VRV Classification of Adult DatasetCompared to J48 Algorithm

<=50K >50K Total0

102030405060708090

Split in halfIncreased low incomeExtreme low income

Income Category

Percent Correct

J48 Classification Algorithm

Problems Binning Quantitative Data

Source: Iris dataset

Contributions1. Vectorized Radviz

1. Application to cluster ensembles provides capability to visually simulate the identification of stable and unstable clusters.

2. Identified several methods to evaluate stability of clusters using characteristics of a VRV visualization.

3. Improved dimensional layout anchor algorithm by moving classification sectors.

4. Used RV to visualize decision trees5. Identified problems when applied to ordered numerical

data6. Successfully applied to microarray data

Contributions (cont’d)

1. Fuzzy Clusters1. Developed method to visualize fuzzy clusters

using RV.2. Developed method to visually compare results of

fuzzy clusters and cluster ensembles applied to the same dataset.

Recommendations

• Adding information to plotted points• Ordering of dimensions within classification

sectors• Selection of base classifier• Investigate visualization of complex decision

trees• Investigate the optimum position of the

classification sectors

Thank you

radviz extensions with applications

dimensional point

applications radviz

lot of radviz expertise

record ij

record id

concept of stability

fuzzy clustersa record

dimension jai

Documents

supply and demand: applications and extensions chapter 4 (pg...

extensions of graphical models with applications in

extensions - new and changed applications...description...

extensions - new and changed applications · 2016. 9....

gis applications (extensions)

real-world applications of the reactive extensions

extensions - new and changed applicationsdescription 12/2010...

proper orthogonal decomposition extensions and...

gtrainfreqonly-class1.rvd - radviz....

put-call symmetry: extensions and applications · put-call...

extensions - new and changed applicationsdescription 10/2012...

the self-organizing maps: background, theories, extensions...

new extensions - new and changed applications · 2016. 9....

chapter 10 arrays ii applications and extensions

extensions and applications of overview pogamut 3 platform

extensions - new and changed applications · xsara (n1)...

cor-test.csv - radviz

plinq parallelize your .net applications with parallel...

extensions - new and changed applications · description...

extensions substation extensions for high-voltage ... ·...