radviz extensions with applications
Post on 24-Feb-2016
49 Views
Preview:
DESCRIPTION
TRANSCRIPT
RadViz Extensions with Applications
Dissertation DefenseJohn Sharko
October 26, 2009
Committee
• Prof. Georges Grinstein (Advisor)• Prof. Kenneth Marx• Prof. Haim Levkowitz• Dr. Patrick Hoffman• Dr. Alex Gee
Outline
• Introduction– RadViz– Cluster Ensembles– Fuzzy Clusters
• Methodology• Contributions• Recommendations
RadViz Example
Description of Traditional RadViz
Each dimension in a dataset is represented by a point, called an anchor point, on the circumference of a circle.
Each record in the dataset is positioned as if it were being pulled by a spring attached to each anchor point where the strength of the spring is proportional to that record’s coordinate or value for the dimension related to that anchor point.
RadViz ExampleAll Coordinate Values Equal
RadViz ExampleTwo Coordinate Values Equal
RadViz ExampleRange of Coordinates Values
RadViz ExampleRange of Coordinates Values
Terminology
• Dimensional Anchor (Anchor Point)– point on the circle representing a dimension
• Point – representation of record(s) within the circle
RadViz Mathematical Formulation
nia
ax
djji
djjji
i ,...,1,cos
,...,1,
,...,1,
where: xi and yi are the resulting transformed coordinates for record iθj is the angular position on the circle corresponding to dimension jai,j is the value for dimension j for record id is the number of dimensions and n the number of records.
nia
ay
djji
djjji
i ,...,1,sin
,...,1,
,...,1,
Impact of Exchanging Dimensional Anchors
A
D C
B A
D
(1, 0, 1, 0)
B
C
Example of Repositioning Anchor Points Using Layout Algorithm
Before repositioning After repositioning
Multiple Clustered Datasets
• Clustering algorithms are heuristic, not optimal
• Different clustering algorithms tend to generate different clusters
Sample Multiple Clustered Dataset
Record Algorithm A Algorithm B Algorithm C
1 1 2 3
2 1 2 3
3 1 2 1
4 2 1 1
5 2 3 2
6 2 3 4
Stable Group of RecordsRecord Algorithm A Algorithm B Algorithm C
1 1 2 3
2 1 2 3
3 1 2 1
4 2 1 1
5 2 3 2
6 2 3 4
Uniquely Clustered RecordRecord Algorithm A Algorithm B Algorithm C
1 1 2 3
2 1 2 3
3 1 2 1
4 2 1 1
5 2 3 2
6 2 3 4
Fuzzy Clusters
• A record belongs to multiple clusters• Varying strengths of association
Record Cluster 1 Cluster 2 Cluster 3 Cluster 4
1 .8 .1 .1 0
2 .5 .4 0 .1
3 .3 .2 .3 .2
Cluster Ensemble vs. Fuzzy Clustering
Cluster ensemble Fuzzy cluster
Multiple instances of hard clustering
A record belongs to multiple clusters
Each record is assigned to one cluster in each instance Varying strengths of association
Using RadViz to Analyze Multiple Clustered Datasets
• RadViz typically deals with real numbers
• Cluster number just does not work
• How do you produce a meaningful RadViz visualization?
Flattening of Categorical Data
• Break up each original dimension into multiple dimensions
• Each new dimension represents a value of the original dimension
Flattening a Dimension
Original
Manufacturer
Model
Type
Price
Flattened
Manufacturer
Model
Small
Large
Sporty
Van
Price
Original Record: (Cadillac, Deville, Large, 33)
Flattened Record: (Cadillac, Deville, 0, 1, 0, 0, 33)
Flattening Multi Cluster DatasetOriginal Dimensions
Flattened Dimensions
12
123
1234
Algorithm A
Algorithm B
Algorithm C
(2, 1, 4) (0, 1, 1, 0, 0, 0, 0, 0, 1 )SampleRecord:
A B C { { {A B C
Simple Example
• Iris dataset• Three cluster sets
– KM1: K-means clustering with 1000 iterations– KM2: K-means clustering with 100,000 iterations– HC: hierarchical clustering
• Ten clusters per cluster set
10
9
8
7
6
5
4
3
2
1
KM1 Color Scale
Flattened Multi-cluster Iris Dataset
HC-6
10
9
8
7
6
5
4
3
2
1
KM1 Color Scale
Flattened Multi-cluster Iris Dataset - Jittered
HC-6
10
9
8
7
6
5
4
3
2
1
KM1 Color Scale
Flattened Multi-cluster Iris Dataset
HC-6
Repositioning Dimensional Anchors
• Move points away from the center
• Separate points
• Increase displayed information content
Class Discrimination Layout Algorithm
•Select a dimension that classifies the records•Assign each dimension to the class with the highest values with respect to the other classes•Move the dimensional anchors assigned to the same class next to each other to form a classification sector
Example of Class Discrimination Layout Algorithm
Before After
Classification Sector 2
Classification Sector 1Class
12
KM1 Cluster Size
30 records
20 records
10 records
5 records
After Repositioning Dimensional Anchors
KM1 Cluster Size
30 records
20 records
10 records
5 records
After Repositioning Dimensional AnchorsKM1-2
KM1 Cluster Size
30 records
20 records
10 records
5 records
After Repositioning Dimensional Anchors
KM1 Cluster Size
30 records
20 records
10 records
5 records
After Repositioning Dimensional Anchors
KM1 Cluster Size
30 records
20 records
10 records
5 records
After Repositioning Dimensional Anchors
KM1 Cluster Size
30 records
20 records
10 records
5 records
After Repositioning Dimensional Anchors
KM1 Cluster Size
30 records
20 records
10 records
5 records
After Repositioning Dimensional Anchors
Moving Similar Classification Sectors Close to Each Other
-Dimensions have been grouped together into classification sectors
-Determine which record classes are most similar to each other using Euclidean distances
-Move those dimension sectors closer to each other using greedy algorithm
-Records will tend to be moved away from the center
10
9
8
7
6
5
4
3
2
1
KM1 Color Scale
9
5
8
47
3
2
1
6
10 HC-8, HC-7KM2-3, KM2-8 KM2-1, HC-10KM2-9, KM1-6
HC-2KM2-4
KM1-5
KM2-6HC-1
KM1-9KM2-7 HC-4 KM1-10
KM2-2
KM1-1
HC-6
KM1-2
HC-5KM2-10
KM1-3KM2-5KM1-7
HC-9
KM1-4
HC-3
KM1-8
Repositioning Classification Sectors
Interpreting Vectorized RadViz
Sepal length
Petal length
SetosaVirsicolorVirginica
Interpreting VRV
Sepal length
Petal length
SetosaVirsicolorVirginica
Interpreting VRV
Sepal length
Petal length
SetosaVirsicolorVirginica
Interpreting VRV
Sepal length
Petal length
SetosaVirsicolorVirginica
Interpreting VRV
Sepal length
Petal length
SetosaVirsicolorVirginica
Salamander Gene Expression Levels
Time
Expr
essio
n Le
vels
Salamander Class 9 GenesNvg00226
Nvg00155
Nvg00111
Nvg00091
Salamander Class 9 Genes
• Nvg00111– “Key” gene– CXC chemokine, ligand 10
• Nvg00226– No homology
• Nvg00155– Keratin type II cytoskeletal
• Nvg00091– Annexin
Fuzzy Clusters
Description of Fuzzy Clusters
• K-means clustering algorithm used• Four clusters• Applied to Iris dataset
Cluster 1
Cluster 4
Cluster 3
Cluster 2
Setosa
Versicolor
Virginica
Outlier
Area of Versicolor and Virginica overlap
RadViz Visualization of Fuzzy Clusters
Sepal Length
Peta
l Len
gth
Setosa
Versicolor
Virginica
Outlier
Scatterplot Visualization of Iris Dataset
Sepal Length
Peta
l Len
gthCluster 1
Cluster 4
Cluster 3
Cluster 2
Comparing Visualizations of Fuzzy Clusters
Cluster 1
Cluster 4
Cluster 3
Cluster 2
Setosa
Versicolor
Virginica
Virginica outlier
Overlap
Central
RadViz Visualization of Fuzzy Clusters
Setosa
Versicolor
VirginicaKey to dimension labeling: Cluster Set-Cluster Numbere.g. KM1-3 is Kmeans set 1 cluster number 3
Virginica outlier
Overlap Central
Vectorized RadViz Visualization of Iris Cluster Ensemble
Cluster 1
Cluster 4
Cluster 3
Cluster 2
Comparison of RadViz Visualizations
Fuzzy Clusters Cluster Ensemble - VRV
Virginica outlier
Cluster 1
Cluster 4
Cluster 3
Cluster 2
Comparison of RadViz Visualizations
Fuzzy Clusters Cluster Ensemble - VRV
Central
Group AGroup B
Group C
RV Visualization of Fuzzy ClustersNewt Microarray Dataset
Group A
Group B
Group C
VRV Visualization of Cluster EnsembleNewt Microarray Dataset
Decision Trees
Decision to Play Tennis
Day Outlook Temperature Humidity Wind Play Tennis1 Sunny Hot High Weak No2 Sunny Hot High Strong No3 Overcast Hot High Weak Yes4 Rain Mild High Weak Yes5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes8 Sunny Mild High Weak No9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No
VRV Applied to an Ordered Numerical Dataset
Adult Income DatasetIncome category (<$50,000, >$50,000)
as a function of:
AgeWork classEducationMarital statusOccupationRelationshipRaceGenderCapital gainCapital lossHours per weekNative country
VRV Applied to the Adult Dataset
< $50,000
> $50,000
VRV as a Classifier
•< $ 50,000 48% correct•> $ 50,000 89% correct
Records Predicted as High IncomeModerate Case
Records Predicted as High IncomeExtreme Case
Summary of Results of VRV Classification of Adult Dataset
<=50K >50K Total0
102030405060708090
100
Split in halfIncreased low incomeExtreme low income
Income Category
Percent Correct
Summary of Results of VRV Classification of Adult DatasetCompared to J48 Algorithm
<=50K >50K Total0
102030405060708090
100
Split in halfIncreased low incomeExtreme low income
Income Category
Percent Correct
J48 Classification Algorithm
Problems Binning Quantitative Data
Source: Iris dataset
Contributions1. Vectorized Radviz
1. Application to cluster ensembles provides capability to visually simulate the identification of stable and unstable clusters.
2. Identified several methods to evaluate stability of clusters using characteristics of a VRV visualization.
3. Improved dimensional layout anchor algorithm by moving classification sectors.
4. Used RV to visualize decision trees5. Identified problems when applied to ordered numerical
data6. Successfully applied to microarray data
Contributions (cont’d)
1. Fuzzy Clusters1. Developed method to visualize fuzzy clusters
using RV.2. Developed method to visually compare results of
fuzzy clusters and cluster ensembles applied to the same dataset.
Recommendations
• Adding information to plotted points• Ordering of dimensions within classification
sectors• Selection of base classifier• Investigate visualization of complex decision
trees• Investigate the optimum position of the
classification sectors
Thank you
top related