recent advances in multivariate data visualisation 1 icsc2010, benjamin radburn-smith,...
TRANSCRIPT
Recent Advances in Multivariate Data Visualisation
1iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Recent Advances in Multivariate Data Visualisation
Benjamin Radburn-Smith
University of Manchester/STFC Rutherford Appleton Laboratory
Inverted CERN School of Computing, 8-9 March 2010
Recent Advances in Multivariate Data Visualisation
2iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Outline
Introduction: Conventional data visualisations Multivariate Data
Overview of data visualisations available
Parallel Coordinates
Grand Tour
General techniques for interactive data mining/exploration
Classifiers and their links to visualisations
Conclusions
Recent Advances in Multivariate Data Visualisation
3iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Introduction
Visualising data is a very powerful technique
In High Energy Physics (HEP) we can use computer graphics
to see the detector geometry and
visualise the particle hits in an event
For example in cmsShow/Fireworks:
But visualising data still relies on conventional techniques
Recent Advances in Multivariate Data Visualisation
4iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Introduction
Majority of data visualisations in the world are either Scatter plots or histograms
Same is true in particle physics
Limited to 2 dimensions, or 3 with use of computer graphics and a 4th using colour (for example)
Recent Advances in Multivariate Data Visualisation
5iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Introduction
Exception in HEP: Dalitz Plot
Used in flavour physics for three-body decays
Scatter plot showing relevantkinematic
information
Plot the mass squared of two ofthe three particles on
each axis
Shows particle signatures of anintermediate stage in
the decay,as well as other interesting
effects, e.g. interference
Recent Advances in Multivariate Data Visualisation
6iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Introduction
A lot of data in the world has many attributes/variables. It is multivariate!
An example of this is particle physics data:
(From TMVA)
Recent Advances in Multivariate Data Visualisation
7iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Introduction
So we have some nice methods of visualising bivariate, low dimensional, data. However a lot of data is multivariate
Problem: Visualising data in more than 3D is rather hard Curse of Dimensionality: Difficult to find interesting
projections of high dimensional data sets
Many data visualisations available: More on scatter plots, e.g. scatter plot matrix Heat map (extension of scatter plot) and Height map Polar Charts RadViz and PolyViz Parallel Coordinates Grand Tour
Recent Advances in Multivariate Data Visualisation
8iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Visualisation Overview
Scatter plots can be binned to become box plots, linked to create line graphs or fitted with curves
Possible to show a scatter plot matrix with each combination of 2D scatter plots shown as a grid
ggobiBox plot
Recent Advances in Multivariate Data Visualisation
9iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Visualisation Overview
Heat map (extension of scatter plot) Plot is divided into cells The density of data points for
each cell is represented by colour, for example
Similar technique is sometimes used in HEP
Recent Advances in Multivariate Data Visualisation
10iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Visualisation Overview
Height map, a further extension Use height instead of colour to represent the density Can use a combination of both Easy to understand Small cell size =
continous map
root.cern.ch
Recent Advances in Multivariate Data Visualisation
11iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Visualisation Overview
Advantages of Heat map and Height map Intuitive visualisation Quickly able to get a feel for the data
Problems of Heat map and Height map Still limited to bivariate 2D data Can be somewhat imprecise
Recent Advances in Multivariate Data Visualisation
12iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Visualisation Overview
Polar Charts & Radar Charts Wrapped around line charts Circular Plot which uses polar coordinates Map data onto 2D surface according to angle and radius
advsofteng.com
Recent Advances in Multivariate Data Visualisation
13iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Visualisation Overview
RadViz Radial Coordinate Visualization Each variable is set around the
edge of the plot as ‘dimensional anchors’
Springs attached between each data point and the anchors with a strength proportional to the data’s value for that dimension
Sum of the forces = 0 for that data point
A lot of work in this area, e.g. vectorised radviz: separating
out multiple clusters
Recent Advances in Multivariate Data Visualisation
14iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Visualisation Overview
Advantages of RadViz Do not need to use projections to view the data Provides a global view of the data Can identify data relations and patterns
Disadvantages of RadViz Layout Problem: Position of the dimensions is important
There are some algorithms available which find the best positions
Crowded plots: Too much data occupying a small space
Recent Advances in Multivariate Data Visualisation
15iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Visualisation Overview
PolyViz Similar to radviz: but the
dimensional anchors are lines not points
Gives distributions of the data for each dimension
But suffers from the same problems as radviz
Recent Advances in Multivariate Data Visualisation
16iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
Dates back to 1885; Maurice D’Ocagne as a method of geometric transformations
Invented by Alfred Inselberg in 1985
Developed by Edward Wegman in 1990 as a multivariate data analysis tool
Principle:
A B2.63 805.31
↔
2D Cartesian 2D Parallel
Recent Advances in Multivariate Data Visualisation
17iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
Not limited to 2 or 3 dimensions
For example a 5D plot:
Can think of each axis as a 1 dimensional view with the maximum value at the top and the minimum value at the bottom
A B C D E2.63 805.31 36.19 -13.54 0.05
Recent Advances in Multivariate Data Visualisation
18iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
With more data
From this:
A B C D E2.63 805.31 36.19 -13.54 0.053.28 648.77 97.16 86.18 -0.08
6.4 1056.5 55.46 164.57 -1.78
To this:
Recent Advances in Multivariate Data Visualisation
19iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
The data correlations between variables
Positive correlation: Lines are parallel
Negative correlation: Lines intersect
Uncorrelated: Lines are random
Recent Advances in Multivariate Data Visualisation
20iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
Examples of data correlations
Mostly positive correlation
Mostly uncorrelated
Recent Advances in Multivariate Data Visualisation
21iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
Dualities Point ↔ Line
A point in 2D Cartesian is represented by a line in parallel coordinates
A line of points in 2D Cartesian is represented by a series of lines that intersect at a point in parallel coordinates
Recent Advances in Multivariate Data Visualisation
22iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
Dualities Rotation ↔ Translation
Rotating the line in the 2D Cartesian system moves the intersection point in the parallel coordinate plot
Moving a point in 2D Cartesian along an axis (eg along
B=0) rotates the corresponding line in the parallel view
Recent Advances in Multivariate Data Visualisation
23iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
Dualities Hyperbola ↔ Ellipse Cusp ↔ Inflection
So a spherical spread of data in 2D Cartesian plot appear as a hyperbolic collection of lines in the parallel view
Recent Advances in Multivariate Data Visualisation
24iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
Cuts / Pruning These shapes show the effect of deleting data instances in
the parallel coordinate view on the corresponding scatter plot
All data instances which pass through the selected a region, marked blue, are removed from the plots
Recent Advances in Multivariate Data Visualisation
25iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
Curved Line Interpolation There can be difficulties in
tracing where straight lines go if multiple instances meet at the same point on an axes
Curved lines can help the user trace the data instance path, i.e. the polyline
Recent Advances in Multivariate Data Visualisation
26iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
Parallel coordinate density plot - by using the transparency of the data on the plot Can see the internal
structure of the data View large amounts
of data
High transparency
Low transparency
Recent Advances in Multivariate Data Visualisation
27iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Parallel Coordinates
Advantages of parallel coordinates: Ability to see all the multivariate data on one plot Find interesting variables to investigate quickly Find interesting data to investigate quickly (e.g. outliers that
skew the datasets) Seek patterns which may help classify the data
Problems with parallel coordinates: Ordering of the axes is important when looking at correlations
between the variables May take a little while to get used to Suffers from over plotting - unless density plots are used
Recent Advances in Multivariate Data Visualisation
28iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Grand Tour
Work on using computer graphics to view projections of high dimensional data started at SLAC in the 70’s and 80’s
Started with Mary Fisherkeller, John Tukey and Jerome Friedman et al on the PRIM-9 system Picturing, Rotating, Inspecting and Masking in up to 9D Developed at the Graphics Interpretation Facility (GIF) at
SLAC using particle physics datasets (bubble chamber) Rotate pairs of axes and view the result of the rotation in a 2D
projection
This work lead to the idea of Projection Pursuit Automatically finds interesting low-dimensional projections of
multivariate data by optimising a projection index
Recent Advances in Multivariate Data Visualisation
29iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Grand Tour
Invented by Daniel Asimov in 1985 and by Asmiov and Andreas Buja in 1986
The grand tour shows high dimensional data rotations, in a similar fashion to a 3D data rotation But in a 3D rotation: rotate an object in space While in higher dimensions: rotate a lower dimensional
projection in the high dimensional space
Rotating data in 2D is around a point
Rotating data in 3D is around a line (axis)
Rotating data in nD is around a hyperplane, where n>3 Hyperplane is an generalisation of a plane. Where the plane
is defined as a 2D subspace in 3D
Recent Advances in Multivariate Data Visualisation
30iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Grand Tour
Goal: show a series of projections, originally 2D planes, of a higher dimensional space
The series of projections are smooth to give the effect of a movie showing (close to) all the possible 2D projections of the data
Unlike Projection Pursuit where the result is an index, the result of a grand tour is the movie itself
Recent Advances in Multivariate Data Visualisation
31iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Grand Tour
The conditions of a grand tour
Sequence of planes (projections) should: be dense in the space of all planes - so is close to any 2D
projection become dense rapidly – by using an efficient algorithm be uniformly distributed - so doesn't linger in one area be continuous - to be comprehendible be reconstructable - e.g. an interesting plane should be
recovered easily after the tour
Recent Advances in Multivariate Data Visualisation
32iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Grand Tour
The way the grand tour finds the interesting projections to show comes from which algorithm is used
The algorithm used has to obey the conditions mentioned on the previous slide It has to be a continuous, space-filling path through the set of
2D subspaces in p dimensional space
Various algorithms can be used Torus Winding Method Random curve Fractal algorithm
Recent Advances in Multivariate Data Visualisation
33iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Grand Tour
Using parallel coordinates, a grand tour of the data does not necessarily have to be via 2D projections
Instead it would be a movie of p-n projections of a p-dimensional space; where n<p
Also possible to have guided tours: where the tour is dictated by the data By optimising the Projection Pursuit index
Find interesting hyperplanes to cut along or classify with
You watch a movie showing a series of projections (2D or higher if using parallel coordinates) of the higher dimensional space and pause it when an interesting projection appears
Recent Advances in Multivariate Data Visualisation
34iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Grand Tour
Advantages of the grand tour: Can lead to interesting hyperplanes in which a cut or
classification can be made Easy to use due to the automation – you watch a movie!
Problems with the grand tour : Difficult to understand the high dimensional rotations Complicated underlying mathematics
Recent Advances in Multivariate Data Visualisation
35iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
General Techniques
Linked Plots: Gives the users the ability to explore their data and find
patterns interactively → Exploratory Data Analysis
Brushing Highlighting data instances with a colour in one plot
automatically updates the same instances in other plots with that colour
Pruning/Deleting Data Deleting data from one plot; the other plots are updated with
the relevant data removed.
Recent Advances in Multivariate Data Visualisation
36iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
General Techniques
Alpha channel: Set the transparency of the data on the plot Ability to see all the data, not just those which were drawn
last; using RGBA colours
Recent Advances in Multivariate Data Visualisation
37iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Classifiers
Try to classify the data into subgroups Such as discriminating a signal and a background
Usually done with algorithms such as: Neural Networks Support Vector Machines Linear Discrimination Analysis Boosted Decision Trees
The classifiers are often seen as black boxes - where the user does not know how it obtains the results it does
Recent Advances in Multivariate Data Visualisation
38iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Classifiers
Can use visualisations to give hints as to which algorithms would work well on the data
In some cases the visualisation techniques can show what the classifier is doing
For example, using visualisations linked to (supervised) SVM we can: Interpret results from SVM and evaluate its quality Choose parameters/kernel functions to boost SVM efficiency
(time etc)
Cooperative Method: Select data points being near the separating boundary - These points are used as support vectors by the RSVM algorithm.
Recent Advances in Multivariate Data Visualisation
39iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Classifiers
A grand tour of parallel coordinates can give hints on which algorithm would work best on the dataset: If the data separates during a tour using the normal
dimensions – this suggests a SVM would work well on the data
If a separation occurs on a combination of dimensions, i.e. Along a new variable created by multiplying variables together – this suggests NN would work well on the data
Sometimes the visualisation may be more powerful than a standard classifier, e.g. SVM are limited to 2 classes while parallel coordinates you
classify into as many classes as you want
Recent Advances in Multivariate Data Visualisation
40iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Conclusions
Other data visualisations are available for data analysis!
Parallel coordinates is a powerful method of viewing multivariate data
Using the parallel coordinate density plot we can view large amounts of multivariate data and find internal structure
A grand tour of the data is another powerful high dimensional data visualisation – we can find interesting classifying hyperplanes
The coupling of grand tour and parallel coordinates gives us another powerful technique
There are links between algorithm classifiers and visualisations
Recent Advances in Multivariate Data Visualisation
41iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
Shouldn’t be like this:
We should make proper use of data visualisations Not just for the sake of it but instead to actually help you
analyse the data: Exploratory Data Analysis
Question whether a particular visualisation is giving us the most relevant information from the data
Conclusions
Recent Advances in Multivariate Data Visualisation
42iCSC2010, Benjamin Radburn-Smith, Manchester/RAL
A. Inselberg, "Parallel Coordinates: Visual Multidimensional Geometry and Its Applications", 2009
D. Cook and D. F. Swayne, "Interactive and Dynamic Graphics for Data Analysis", 2007
W. L. Martinez, A. R. Martinez, "Exploratory data analysis with MATLAB", 2005
Further information including links to the movies shown can be found at: http://www.hep.manchester.ac.uk/u/benjamin/iCSC.html
Bibliography and Information