recent advances in multivariate data visualisation 1 icsc2010, benjamin radburn-smith,...

42
Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation Benjamin Radburn-Smith University of Manchester/STFC Rutherford Appleton Laboratory Inverted CERN School of Computing, 8-9 March 2010

Upload: mervyn-park

Post on 20-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

1iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Recent Advances in Multivariate Data Visualisation

Benjamin Radburn-Smith

University of Manchester/STFC Rutherford Appleton Laboratory

Inverted CERN School of Computing, 8-9 March 2010

Page 2: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

2iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Outline

Introduction: Conventional data visualisations Multivariate Data

Overview of data visualisations available

Parallel Coordinates

Grand Tour

General techniques for interactive data mining/exploration

Classifiers and their links to visualisations

Conclusions

Page 3: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

3iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Introduction

Visualising data is a very powerful technique

In High Energy Physics (HEP) we can use computer graphics

to see the detector geometry and

visualise the particle hits in an event

For example in cmsShow/Fireworks:

But visualising data still relies on conventional techniques

Page 4: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

4iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Introduction

Majority of data visualisations in the world are either Scatter plots or histograms

Same is true in particle physics

Limited to 2 dimensions, or 3 with use of computer graphics and a 4th using colour (for example)

Page 5: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

5iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Introduction

Exception in HEP: Dalitz Plot

Used in flavour physics for three-body decays

Scatter plot showing relevantkinematic

information

Plot the mass squared of two ofthe three particles on

each axis

Shows particle signatures of anintermediate stage in

the decay,as well as other interesting

effects, e.g. interference

Page 6: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

6iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Introduction

A lot of data in the world has many attributes/variables. It is multivariate!

An example of this is particle physics data:

(From TMVA)

Page 7: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

7iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Introduction

So we have some nice methods of visualising bivariate, low dimensional, data. However a lot of data is multivariate

Problem: Visualising data in more than 3D is rather hard Curse of Dimensionality: Difficult to find interesting

projections of high dimensional data sets

Many data visualisations available: More on scatter plots, e.g. scatter plot matrix Heat map (extension of scatter plot) and Height map Polar Charts RadViz and PolyViz Parallel Coordinates Grand Tour

Page 8: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

8iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Visualisation Overview

Scatter plots can be binned to become box plots, linked to create line graphs or fitted with curves

Possible to show a scatter plot matrix with each combination of 2D scatter plots shown as a grid

ggobiBox plot

Page 9: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

9iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Visualisation Overview

Heat map (extension of scatter plot) Plot is divided into cells The density of data points for

each cell is represented by colour, for example

Similar technique is sometimes used in HEP

Page 10: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

10iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Visualisation Overview

Height map, a further extension Use height instead of colour to represent the density Can use a combination of both Easy to understand Small cell size =

continous map

root.cern.ch

Page 11: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

11iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Visualisation Overview

Advantages of Heat map and Height map Intuitive visualisation Quickly able to get a feel for the data

Problems of Heat map and Height map Still limited to bivariate 2D data Can be somewhat imprecise

Page 12: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

12iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Visualisation Overview

Polar Charts & Radar Charts Wrapped around line charts Circular Plot which uses polar coordinates Map data onto 2D surface according to angle and radius

advsofteng.com

Page 13: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

13iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Visualisation Overview

RadViz Radial Coordinate Visualization Each variable is set around the

edge of the plot as ‘dimensional anchors’

Springs attached between each data point and the anchors with a strength proportional to the data’s value for that dimension

Sum of the forces = 0 for that data point

A lot of work in this area, e.g. vectorised radviz: separating

out multiple clusters

Page 14: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

14iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Visualisation Overview

Advantages of RadViz Do not need to use projections to view the data Provides a global view of the data Can identify data relations and patterns

Disadvantages of RadViz Layout Problem: Position of the dimensions is important

There are some algorithms available which find the best positions

Crowded plots: Too much data occupying a small space

Page 15: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

15iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Visualisation Overview

PolyViz Similar to radviz: but the

dimensional anchors are lines not points

Gives distributions of the data for each dimension

But suffers from the same problems as radviz

Page 16: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

16iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

Dates back to 1885; Maurice D’Ocagne as a method of geometric transformations

Invented by Alfred Inselberg in 1985

Developed by Edward Wegman in 1990 as a multivariate data analysis tool

Principle:

A B2.63 805.31

2D Cartesian 2D Parallel

Page 17: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

17iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

Not limited to 2 or 3 dimensions

For example a 5D plot:

Can think of each axis as a 1 dimensional view with the maximum value at the top and the minimum value at the bottom

A B C D E2.63 805.31 36.19 -13.54 0.05

Page 18: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

18iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

With more data

From this:

A B C D E2.63 805.31 36.19 -13.54 0.053.28 648.77 97.16 86.18 -0.08

6.4 1056.5 55.46 164.57 -1.78

To this:

Page 19: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

19iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

The data correlations between variables

Positive correlation: Lines are parallel

Negative correlation: Lines intersect

Uncorrelated: Lines are random

Page 20: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

20iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

Examples of data correlations

Mostly positive correlation

Mostly uncorrelated

Page 21: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

21iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

Dualities Point ↔ Line

A point in 2D Cartesian is represented by a line in parallel coordinates

A line of points in 2D Cartesian is represented by a series of lines that intersect at a point in parallel coordinates

Page 22: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

22iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

Dualities Rotation ↔ Translation

Rotating the line in the 2D Cartesian system moves the intersection point in the parallel coordinate plot

Moving a point in 2D Cartesian along an axis (eg along

B=0) rotates the corresponding line in the parallel view

Page 23: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

23iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

Dualities Hyperbola ↔ Ellipse Cusp ↔ Inflection

So a spherical spread of data in 2D Cartesian plot appear as a hyperbolic collection of lines in the parallel view

Page 24: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

24iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

Cuts / Pruning These shapes show the effect of deleting data instances in

the parallel coordinate view on the corresponding scatter plot

All data instances which pass through the selected a region, marked blue, are removed from the plots

Page 25: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

25iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

Curved Line Interpolation There can be difficulties in

tracing where straight lines go if multiple instances meet at the same point on an axes

Curved lines can help the user trace the data instance path, i.e. the polyline

Page 26: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

26iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

Parallel coordinate density plot - by using the transparency of the data on the plot Can see the internal

structure of the data View large amounts

of data

High transparency

Low transparency

Page 27: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

27iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Parallel Coordinates

Advantages of parallel coordinates: Ability to see all the multivariate data on one plot Find interesting variables to investigate quickly Find interesting data to investigate quickly (e.g. outliers that

skew the datasets) Seek patterns which may help classify the data

Problems with parallel coordinates: Ordering of the axes is important when looking at correlations

between the variables May take a little while to get used to Suffers from over plotting - unless density plots are used

Page 28: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

28iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Grand Tour

Work on using computer graphics to view projections of high dimensional data started at SLAC in the 70’s and 80’s

Started with Mary Fisherkeller, John Tukey and Jerome Friedman et al on the PRIM-9 system Picturing, Rotating, Inspecting and Masking in up to 9D Developed at the Graphics Interpretation Facility (GIF) at

SLAC using particle physics datasets (bubble chamber) Rotate pairs of axes and view the result of the rotation in a 2D

projection

This work lead to the idea of Projection Pursuit Automatically finds interesting low-dimensional projections of

multivariate data by optimising a projection index

Page 29: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

29iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Grand Tour

Invented by Daniel Asimov in 1985 and by Asmiov and Andreas Buja in 1986

The grand tour shows high dimensional data rotations, in a similar fashion to a 3D data rotation But in a 3D rotation: rotate an object in space While in higher dimensions: rotate a lower dimensional

projection in the high dimensional space

Rotating data in 2D is around a point

Rotating data in 3D is around a line (axis)

Rotating data in nD is around a hyperplane, where n>3 Hyperplane is an generalisation of a plane. Where the plane

is defined as a 2D subspace in 3D

Page 30: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

30iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Grand Tour

Goal: show a series of projections, originally 2D planes, of a higher dimensional space

The series of projections are smooth to give the effect of a movie showing (close to) all the possible 2D projections of the data

Unlike Projection Pursuit where the result is an index, the result of a grand tour is the movie itself

Page 31: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

31iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Grand Tour

The conditions of a grand tour

Sequence of planes (projections) should: be dense in the space of all planes - so is close to any 2D

projection become dense rapidly – by using an efficient algorithm be uniformly distributed - so doesn't linger in one area be continuous - to be comprehendible be reconstructable - e.g. an interesting plane should be

recovered easily after the tour

Page 32: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

32iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Grand Tour

The way the grand tour finds the interesting projections to show comes from which algorithm is used

The algorithm used has to obey the conditions mentioned on the previous slide It has to be a continuous, space-filling path through the set of

2D subspaces in p dimensional space

Various algorithms can be used Torus Winding Method Random curve Fractal algorithm

Page 33: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

33iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Grand Tour

Using parallel coordinates, a grand tour of the data does not necessarily have to be via 2D projections

Instead it would be a movie of p-n projections of a p-dimensional space; where n<p

Also possible to have guided tours: where the tour is dictated by the data By optimising the Projection Pursuit index

Find interesting hyperplanes to cut along or classify with

You watch a movie showing a series of projections (2D or higher if using parallel coordinates) of the higher dimensional space and pause it when an interesting projection appears

Page 34: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

34iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Grand Tour

Advantages of the grand tour: Can lead to interesting hyperplanes in which a cut or

classification can be made Easy to use due to the automation – you watch a movie!

Problems with the grand tour : Difficult to understand the high dimensional rotations Complicated underlying mathematics

Page 35: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

35iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

General Techniques

Linked Plots: Gives the users the ability to explore their data and find

patterns interactively → Exploratory Data Analysis

Brushing Highlighting data instances with a colour in one plot

automatically updates the same instances in other plots with that colour

Pruning/Deleting Data Deleting data from one plot; the other plots are updated with

the relevant data removed.

Page 36: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

36iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

General Techniques

Alpha channel: Set the transparency of the data on the plot Ability to see all the data, not just those which were drawn

last; using RGBA colours

Page 37: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

37iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Classifiers

Try to classify the data into subgroups Such as discriminating a signal and a background

Usually done with algorithms such as: Neural Networks Support Vector Machines Linear Discrimination Analysis Boosted Decision Trees

The classifiers are often seen as black boxes - where the user does not know how it obtains the results it does

Page 38: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

38iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Classifiers

Can use visualisations to give hints as to which algorithms would work well on the data

In some cases the visualisation techniques can show what the classifier is doing

For example, using visualisations linked to (supervised) SVM we can: Interpret results from SVM and evaluate its quality Choose parameters/kernel functions to boost SVM efficiency

(time etc)

Cooperative Method: Select data points being near the separating boundary - These points are used as support vectors by the RSVM algorithm.

Page 39: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

39iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Classifiers

A grand tour of parallel coordinates can give hints on which algorithm would work best on the dataset: If the data separates during a tour using the normal

dimensions – this suggests a SVM would work well on the data

If a separation occurs on a combination of dimensions, i.e. Along a new variable created by multiplying variables together – this suggests NN would work well on the data

Sometimes the visualisation may be more powerful than a standard classifier, e.g. SVM are limited to 2 classes while parallel coordinates you

classify into as many classes as you want

Page 40: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

40iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Conclusions

Other data visualisations are available for data analysis!

Parallel coordinates is a powerful method of viewing multivariate data

Using the parallel coordinate density plot we can view large amounts of multivariate data and find internal structure

A grand tour of the data is another powerful high dimensional data visualisation – we can find interesting classifying hyperplanes

The coupling of grand tour and parallel coordinates gives us another powerful technique

There are links between algorithm classifiers and visualisations

Page 41: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

41iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

Shouldn’t be like this:

We should make proper use of data visualisations Not just for the sake of it but instead to actually help you

analyse the data: Exploratory Data Analysis

Question whether a particular visualisation is giving us the most relevant information from the data

Conclusions

Page 42: Recent Advances in Multivariate Data Visualisation 1 iCSC2010, Benjamin Radburn-Smith, Manchester/RAL Recent Advances in Multivariate Data Visualisation

Recent Advances in Multivariate Data Visualisation

42iCSC2010, Benjamin Radburn-Smith, Manchester/RAL

A. Inselberg, "Parallel Coordinates: Visual Multidimensional Geometry and Its Applications", 2009

D. Cook and D. F. Swayne, "Interactive and Dynamic Graphics for Data Analysis", 2007

W. L. Martinez, A. R. Martinez, "Exploratory data analysis with MATLAB", 2005

Further information including links to the movies shown can be found at: http://www.hep.manchester.ac.uk/u/benjamin/iCSC.html

Bibliography and Information