visual data mining

68
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 7, NO. 1, JANUARY- MARCH 2002 100 Information Visualization and Visual Data Mining Daniel A. Keim Abstract—Never before in history data has been generated at such high volumes as it is today. Exploring and analyzing the vast volumes of data becomes increasingly difficult. Information visualization and visual data mining can help to deal with the flood of information. The advantage of visual data exploration is that the user is directly involved in the data mining process. There is a large number of information

Upload: shyam-kumar

Post on 28-Nov-2014

74 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Visual Data Mining

IEEE TRANSACTIONS ON VISUALIZATION AND

COMPUTER GRAPHICS, VOL. 7 , NO. 1 , JANUARY-

MARCH 2002 100

Information Visual izat ion and Visual Data Mining

Danie l A. Keim

Abstract—Never before in h istory data has been

generated

at such h igh volumes as i t i s today. Explor ing and

analyz ing

the vast vo lumes of data becomes increas ingly d i f f icu l t .

In format ion v isual izat ion and v isual data mining can help

to

deal wi th the f lood of in format ion. The advantage of

v isual

data explorat ion is that the user is d i rect ly involved in

the

data mining process. There is a large number of

in format ion

visual izat ion techniques which have been developed over

the

last decade to support the explorat ion of large data sets .

In

th is paper , we propose a c lass i f icat ion of in format ion

v isual izat ion

Page 2: Visual Data Mining

and v isual data mining techniques which is based on the

data type to be v isual ized , the v isual izat ion technique

and the interact ion and d istort ion technique . We

exempl i fy the c lass i f icat ion us ing a few examples, most

of them referr ing to

techniques and systems presented in th is specia l issue.

Keywords— Informat ion Visual izat ion, V isual Data

Min ing,

Visual Data Explorat ion, C lass i f icat iona I . Introduct ion

The progress made in hardware technology a l lows

today’s

computer systems to store very large amounts of data.

Researchers f rom the Univers i ty of Berkeley est imate

that

every year about 1 Exabyte (= 1 Mi l l ion Terabyte) of

data

are generated, of which a large port ion is avai lable in

d ig i ta l

form. This means that in the next three years more

data wi l l be generated than in a l l o f human history

before.

The data is of ten automat ica l ly recorded v ia sensors and

monitor ing systems. Even s imple t ransact ions of every

day

Page 3: Visual Data Mining

l i fe , such as paying by credi t card or us ing the

te lephone,

are typica l ly recorded by computers . Usual ly , many

parameters

are recorded, resul t ing in mult id imensional data

with a h igh d imensional i ty . The data of a l l ment ioned

areas

is co l lected because people bel ieve that i t i s a potent ia l

source of va luable informat ion, provid ing a compet i t ive

advantage(at some point) . F inding the valuable

informat ion

hidden in them, however, is a d i f f icu l t task. With today’s

data management systems, i t i s only poss ib le to v iew

qui te

smal l port ions of the data. I f the data is presented

textual ly ,

the amount of data which can be d isp layed is in the

range of some one hundred data i tems, but th is is l ike a

drop in the ocean when deal ing with data sets conta in ing

mi l l ions of data i tems. Having no poss ib i l i ty to

adequately

explore the large amounts of data which have been

col lected

because of thei r potent ia l usefu lness, the data becomes

useless and the databases become data ‘dumps’ .

Page 4: Visual Data Mining

Danie l A. Keim is current ly with AT&T Shannon Research

Labs,

F lorham Park, NJ , USA and the Univers i ty of Constance,

Germany

E-mai l : ke [email protected] .com.

This is an extended vers ion of [6] , port ions of which are

copyr ighted

by ACM.

Benef i ts of V isual Data Explorat ion

For data mining to be ef fect ive, i t i s important to inc lude

the human in the data explorat ion process and combine

the

f lex ib i l i ty , creat iv i ty , and general knowledge of the

human

with the enormous storage capaci ty and the

computat ional

power of today’s computers . V isual data explorat ion

a ims

at integrat ing the human in the data explorat ion

process,

apply ing i ts perceptual abi l i t ies to the large data sets

avai lable

in today’s computer systems. The bas ic idea of v isual

data explorat ion is to present the data in some v isual

form,

Page 5: Visual Data Mining

al lowing the human to get ins ight into the data, draw

conclus ions,

and d i rect ly interact wi th the data. V isual data

mining techniques have proven to be of h igh value in

exploratory

data analys is and they a lso have a h igh potent ia l

for explor ing large databases. V isual data explorat ion is

especia l ly usefu l when l i t t le is known about the data and

the explorat ion goals are vague. S ince the user is

d i rect ly

involved in the explorat ion process, sh i f t ing and

adjust ing

the explorat ion goals is automat ica l ly done i f necessary.

The v isual data explorat ion process can be seen a

hypothes is

generat ion process: The v isual izat ions of the data

al low the user to gain ins ight into the data and come up

with new hypotheses. The ver i f icat ion of the hypotheses

can a lso be done v ia v isual data explorat ion but i t may

a lso

be accompl ished by automat ic techniques f rom stat ist ics

or

machine learning. In addi t ion to the d i rect involvement

of

the user , the main advantages of v isual data explorat ion

Page 6: Visual Data Mining

over automat ic data mining techniques f rom stat ist ics or

machine learning are:

• v isual data explorat ion can eas i ly deal wi th h ighly

inhomogeneous

and noisy data

• v isual data explorat ion is intu i t ive and requires no

understanding

of complex mathemat ica l or stat ist ica l a lgor i thms

or parameters .

As a resul t , v isual data explorat ion usual ly a l lows a

faster data explorat ion and often provides better resul ts ,

especia l ly in cases where automat ic a lgor i thms fa i l . In

addi t ion,

v isual data explorat ion techniques provide a much

higher degree of conf idence in the f indings of the

explorat ion.

This fact leads to a h igh demand for v isual explorat ion

techniques and makes them indispensable in conjunct ion

with automat ic explorat ion techniques.

Visual Explorat ion Paradigm

Visual Data Explorat ion usual ly fo l lows a three step

process:

Overv iew f i rst , zoom and f i l ter , and then deta i ls -

ondemand

(which has been cal led the Informat ion Seeking

Page 7: Visual Data Mining

Mantra [1]) . F i rst , the user needs to get an overv iew of

the data. In the overv iew, the user ident i f ies interest ing

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER

GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 101

patterns and focuses on one or more of them. For

analyz ing

the patterns, the user needs to dr i l l -down and access

deta i ls of the data. V isual izat ion technology may be

used

for a l l three steps of the data explorat ion process:

V isual izat ion

techniques are usefu l for showing an overv iew of the

data, a l lowing the user to ident i fy interest ing subsets . In

th is step, i t i s important to keep the overv iew

v isual izat ion

whi le focus ing on the subset us ing an other v isual izat ion

technique. An a l ternat ive is to d istort the overv iew

v isual izat ion

in order to focus on the interest ing subsets . To

further explore the interest ing subsets , the user needs a

dr i l l -down capabi l i ty in order to get the deta i ls about the

data. Note that v isual izat ion technology does not only

provide

the base v isual izat ion techniques for a l l three steps but

a lso br idges the gaps between the steps.

Page 8: Visual Data Mining

I I . C lass i f icat ion of V isual Data Min ing

Techniques

Informat ion v isual izat ion focuses on data sets lack ing

inherent

2D or 3D semant ics and therefore a lso lack ing a

standard mapping of the abstract data onto the phys ica l

screen space. There are a number of wel l known

techniques

for v isual iz ing such data sets such as x-y p lots ,

l ine p lots , and h istograms. These techniques are usefu l

for data explorat ion but are l imited to re lat ive ly smal l

and

low-dimensional data sets . In the last decade, a large

number

of novel in format ion v isual izat ion techniques have been

developed, a l lowing v isual izat ions of mult id imensional

data

sets without inherent two- or three-d imensional

semant ics .

Nice overv iews of the approaches can be found in a

number

of recent books [2] [3] [4] [5] . The techniques can be

c lass i f ied

based on three cr i ter ia (see f igure 1) [6] : The data to be

Page 9: Visual Data Mining

visual ized, the v isual izat ion technique, and the

interact ion

and d istort ion technique used.

The data type to be v isual ized [1] may be

• One-dimensional data, such as temporal data as used

in

ThemeRiver (see f igure 2 in [7])

• Two-dimensional data, such as geographica l maps as

used in Polar is (see f igure 3(c) in [8]) and MGV (see

f igure

9 in [9])

• Mult id imensional data, such as re lat ional tables as

used

in Polar is (see f igure 6 in [8]) and the Scalable

Framework

(see f igure 1 in [10])

• Text and hypertext , such as news art ic les and Web

documents

as used in ThemeRiver (see f igure 2 in [7])

• Hierarchies and graphs, such as te lephone cal ls and

Web

documents as used in MGV (see f igure 13 in [9]) and the

Scalable Framework (see f igure 7 in [10])

• A lgor i thms and software, such as debugging operat ions

as used in Polar is (see f igure 7 in [8])

Page 10: Visual Data Mining

The v isual izat ion technique used may be c lass i f ied into

• Standard 2D/3D disp lays, such as bar charts and x-y

plots as used in Polar is (see f igure 1 in [8])

• Geometr ica l ly t ransformed disp lays, such as

landscapes

and para l le l coordinates as used in Scalable Framework

(see

f igures 2 and 12 in [10])

F ig . 1 . C lass i f icat ion of Informat ion Visual izat ion

Techniques

• Icon-based d isp lays, such as needle icons and star

icons

as used in MGV (see f igures 5 and 6 in [9])

• Dense p ixel d isp lays, such as the recurs ive pattern and

circ le segments techniques (see f igures 3 and 4) [11]

and the

graph scetches as used in MGV (see f igure 4 in [9])

• Stacked d isp lays, such as t reemaps [12] [13] or

d imensional

stacking [14]

The th i rd d imension of the c lass i f icat ion is the

interact ion

and d istort ion technique used. Interact ion and

distort ion techniques a l low users to d i rect ly interact

wi th

Page 11: Visual Data Mining

the v isual izat ions. They may be c lass i f ied into

• Interact ive Pro ject ion as used in the GrandTour system

[15]

• Interact ive F i l ter ing as used in Polar is (see f igure 6 in

[8])

• Interact ive Zooming as used in MGV and the Scalable

Framework (see f igure 8 in [10])

• Interact ive Distort ion as used in the Scalable

Framework

(see f igure 7 in [10])

• Interact ive L ink ing and Brushing as used in Polar is

(see

f igure 7 in [8]) and the Scalable Framework (see f igures

12

and 14 in [10])

Note that the three d imensions of our c lass i f icat ion -

data type to be v isual ized, v isual izat ion technique, and

interact ion

& distort ion technique - can be assumed to be

orthogonal . Orthogonal i ty means that any of the

v isual izat ion

techniques may be used in conjunct ion with any of

the interact ion techniques as wel l as any of the

d istort ion

Page 12: Visual Data Mining

techniques for any data type. Note a lso that a speci f ic

system

may be des igned to support d i f ferent data types and

that i t may use a combinat ion of mult ip le v isual izat ion

and

interact ion techniques.

I I I . Data Type to be Visual ized

In informat ion v isual izat ion, the data usual ly cons ists

of a large number of records each consist ing of a number

of var iables or d imensions. Each record corresponds

to an observat ion, measurement, t ransact ion, etc .

Examples

are customer propert ies , e-commerce transact ions, and

physica l exper iments. The number of attr ibutes can

d i f fer

f rom data set to data set : One part icu lar phys ica l

exper iment,

for example, can be descr ibed by f ive var iables,

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER

GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 102

whi le an other may need hundreds of var iables. We cal l

the number of var iables the d imensional i ty of the data

set .

Data sets may be one-dimensional , two-dimensional ,

mult id imensional

Page 13: Visual Data Mining

or may have more complex data types such

as text /hypertext or h ierarchies/graphs. Somet imes, a

d ist ict ion

is made between dense (or gr id) d imensions and

the d imensions which may have arb i t rary values.

Depending

on the number of d imensions with arb i t rary values the

data is somet imes a lso ca l led univar iate, b ivar iate, etc .

One-dimensional data

One-dimensional data usual ly has one dense d imension.

A typica l example of one-d imensional data is temporal

data. Note that with each point of t ime, one or mult ip le

data values may be associated. An example are t ime

ser ies of stock pr ices (see f igure 3 and f igure 4 for an

example)

or the t ime ser ies of news data used in the ThemeRiver

examples (see f igures 2-5 in [7]) .

Two-dimensional data

Two-dimensional data has two d ist inct d imensions. A

typica l example is geographica l data where the two

d ist inct

d imensions are longi tude and lat i tude. X-Y-p lots are a

typica l

method for showing two-dimensional data and maps

Page 14: Visual Data Mining

are a specia l type of x-y-p lots for showing two-

dimensional

geographica l data. Examples are the geographica l maps

used in Polar is (see f igure 3(c) in [8]) and in MGV (see

f igure

9 in [9]) . A l though i t seems easy to deal wi th temporal

or geographic data, caut ion is advised. I f the number of

records to be v isual ized is large, temporal axes and

maps

get quick ly g lutted - and may not help to understand the

data.

Mult i -d imensional data

Many data sets cons ists of more than three attr ibutes

and therefore, they do not a l low a s imple v isual izat ion

as

2-d imensional or 3-d imensional p lots . Examples of

mult id imensional

(or mult ivar iate) data are tables f rom re lat ional

databases, which often have tens to hundreds of

co lumns

(or attr ibutes) . S ince there is no s imple mapping of the

attr ibutes

to the two d imensions of the screen, more sophist icated

visual izat ion techniques are needed. An example of

Page 15: Visual Data Mining

a technique which a l lows the v isual izat ion of

mult id imensional

data is the Para l le l Coordinate Technique [16] (see

f igure 2, which is a lso used in the Scalable Framework

(see

f igure 12 in [10]) . Para l le l Coordinates d isp lay each

mult id imensional

data i tem as a polygonal l ine which intersects

the hor izonta l d imension axes at the pos i t ion

corresponding

to the data value for the corresponding d imension.

Text & Hypertext

Not a l l data types can be descr ibed in terms of

d imensional i ty .

In the age of the wor ld wide web, one important

data type is text and hypertext as wel l as mult imedia

web

page contents . These data types d i f fer in that they can

not

be eas i ly descr ibed by numbers and therefore, most of

the

standard v isual izat ion techniques can not be appl ied. In

F ig. 2 . Para l le l Coordinate Visual izat ion c IEEE

most cases, f i rst a t ransformat ion of the data into

descr ipt ion

Page 16: Visual Data Mining

vectors is necessary before v isual izat ion techniques can

be used. An example for a s imple t ransformat ion is word

count ing (see ThemeRiver [7]) which is of ten combined

with a pr inc ipal component analys is or mult id imensional

scal ing ( for example, see [17]) .

Hierarchies & Graphs

Data records often have some re lat ionship to other

p ieces

of informat ion. Graphs are widely used to represent such

interdependencies. A graph consists of set of objects ,

ca l led

nodes, and connect ions between these objects , ca l led

edges.

Examples are the e-mai l interre lat ionships among

people,

their shopping behavior , the f i le structure of the hard

d isk

or the hyper l inks in the wor ld wide web. There are a

number

of speci f ic v isual izat ion techniques that deal wi th

h ierarchica l

and graphica l data. A n ice overv iew of h ierachica l

informat ion v isual izat ion techniques can be found in

[18] ,

Page 17: Visual Data Mining

an overv iew of web v isual izat ion techniques at [19] and

an

overv iew book on a l l aspects re lated to graph drawing is

[20] .

A lgor i thms & Software

Another c lass of data are a lgor i thms & software. Coping

with large software pro jects is a chal lenge. The goal of

v isual izat ion

is to support software development by help ing

to understand a lgor i thms, e .g. by showing the f low of

informat ion

in a program, to enhance the understanding of

wr i t ten code, e .g. by represent ing the structure of

thousands

of source code l ines as graphs, and to support the

programmer in debugging the code, i .e . by v isual iz ing

errors .

There are a large number of too ls and systems which

support these tasks. An n ice overv iew can be found at

[21] .

IV. V isual izat ion Techniques

There is a large number of v isual izat ion techniques

which

can be used for v isual iz ing the data. In addi t ion to

standard 2D/3D-techniques such as x-y (x-y-z) p lots , bar

Page 18: Visual Data Mining

charts , l ine graphs, etc . , there are a number of more

sophist icated

visual izat ion techniques. The c lasses correspond to

basic v isual izat ion pr inc ip les which may be combined in

order to implement a speci f ic v isual izat ion system.

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER

GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 103

Fig. 3 . Dense P ixel Disp lays: Recurs ive Pattern

Technique c IEEE

Geometr ica l ly -Transformed Displays

Geometr ica l ly t ransformed disp lay techniques a im at

f inding “ interest ing” t ransformat ions of mult id imensional

data sets . The c lass of geometr ic d isp lay techniques

inc ludes

techniques f rom exploratory stat ist ics such as

scatterp lot

matr ices [22] [23] and techniques which can be

subsumed

under the term “pro ject ion pursui t” [24] . Other

geometr ic pro ject ion techniques inc lude Prosect ion

Views

[25] [26] , Hypers l ice [27] , and the wel l -known Para l le l

Coordinates

visual izat ion technique [16] . The para l le l coordinate

technique maps the k-d imensional space onto the two

Page 19: Visual Data Mining

display d imensions by us ing k equid istant axes which are

paral le l to one of the d isp lay axes. The axes corespond

to

the d imensions and are l inear ly scaled f rom the

minimum to

the maximum value of the corresponding d imension.

Each

data i tem is presented as a polygonal l ine, intersect ing

each

of the axes at that point which corresponds to the value

of

the considered d imensions (see f igure 2) .

Iconic Disp lays

Another c lass of v isual data explorat ion techniques are

the iconic d isp lay techniques. The idea is to map the

attr ibute

values of a mult i -d imensional data i tem to the features

of an icon. Icons can be arb i t ra i ly def ined: They may

be l i t t le faces [28] , needle icons as used in MGV (see

f igure

5 in [9]) , s tar icons [14] , s t ick f igure icons [29] , co lor

icons

[30] , [31] , and T i leBars [32] . The v isual izat ion is

generated

Page 20: Visual Data Mining

by mapping the attr ibute values of each data record to

the

features of the icons. In case of the st ick f igure

technique,

for example, two d imensions are mapped to the d isp lay

dimensions and the remain ing d imensions are mapped to

the angles and/or l imb length of the st ick f igure icon. I f

the data i tems are re lat ive ly dense with respect to the

two

display d imensions, the resul t ing v isual izat ion presents

texture

patterns that vary according to the character ist ics of

the data and are therefore detectable by preattent ive

percept ion.

F ig . 4 . Dense P ixel Disp lays: C i rc le Segments Technique

c IEEE

Dense P ixel Disp lays

The bas ic idea of dense p ixel techniques is to map each

dimension value to a co lored p ixel and group the p ixels

belonging

to each d imension into adjacent areas [11] . S ince

in general dense p ixel d isp lays use one p ixel per data

value,

the techniques a l low the v isual izat ion of the largest

amount

Page 21: Visual Data Mining

of data poss ib le on current d isp lays (up to about

1.000.000

data values) . I f each data value is represented by one

pixel , the main quest ion is how to arrange the p ixels on

the screen. Dense p ixel techniques use d i f ferent

arrangments

for d i f ferent purposes. By arranging the p ixels in an

appropr iate way, the resul t ing v isual izat ion provides

deta i led

informat ion on local corre lat ions, dependencies, and

hot spots .

Wel l -known examples are the recurs ive pattern

technique

[33] und the c i rc le segments technique [34] . The

recurs ive

pattern technique is based on a gener ic recurs ive back-

andforth

arrangement of the p ixels and is part icu lar a imed at

represent ing datasets with a natura l order according to

one

attr ibute (e.g. t ime ser ies data) . The user may speci fy

parameters

for each recurs ion level , and thereby contro ls the

arrangement of the p ixels to form semant ica l ly

meaningfu l

Page 22: Visual Data Mining

substructures. The base e lement on each recurs ion level

is a pattern of height h i und width wi as speci f ied by the

user . F i rst , the e lements correspond to s ingle p ixels

which

are arranged with in a rectangle of height h1 and width

w1

from lef t to r ight , then below backwards f rom r ight to

lef t ,

then again forward f rom lef t to r ight , and so on. The

same

basic arrangement is done on a l l recurs ion levels with

the

only d i f ference that the bas ic e lements which are

arranged

on level i are the pattern resul t ing f rom the level ( i − 1)

arrangements. In f igure 3, an example recurs ive pattern

visual izat ion of f inancia l data is shown. The v isual izat ion

shows twenty years ( January 1974 - Apr i l 1995) of dai ly

pr ices of the 100 stocks conta ined in the Frankfurt Stock

Index (FAZ). The idea of the c i rc le segments technique

[34]

is to represent the data in a c i rc le which is d iv ided into

segments,

one for each attr ibute. With in the segments each

Page 23: Visual Data Mining

attr ibute value is again v isual ized by a s ingle co lored

p ixel .

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER

GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 104

Fig. 5 . Dimensional Stacking Visual izat ion of Oi l Min ing

Data

(used by permiss ion of M. Ward, Worchester Po lytechnic

c IEEE)

The arrangment of the p ixels starts at the center of the

c i rc le and cont inues to the outs ide by p lott ing on a l ine

orthogonal to the segment halv ing l ine in a back and

forth

manner. The rat ional of th is approach is that c lose to the

center a l l at t r ibutes are c lose to each other enhancing

the

visual compar ison of thei r va lues. F igure 4 shows an

example

c i rc le segment v isual izat ion of the same data (50 stocks)

as shown in f igure 3.

Stacked Displays

Stacked d isp lay techniques are ta i lored to present data

part i t ioned in a h ierarchica l fashion. In case of

mult id imensional

data, the data d imensions to be used for part i t ion ing

the data and bui ld ing the h ierarchy have to be

Page 24: Visual Data Mining

selected appropr iate ly . An example of a stacked d isp lay

technique is Dimensional Stacking [35] . The bas ic idea is

to embed one coordinate systems ins ide an other

coordinate

system, i .e . two attr ibutes form the outer coordinate

system, two other attr ibutes are embedded into the

outer

coordinate system, and so on. The d isp lay is generated

by d iv id ing the outmost level coordinate systems into

rectangular

cel ls and with in the cel ls the next two attr ibutes

are used to span the second level coordinate system.

This

process may be repeated one more t ime. The usefu lness

of the resul t ing v isual izat ion largely depends on the

data

distr ibut ion of the outer coordinates and therefore the

d imensions

which are used for def in ing the outer coordinate

system have to be se lected carefu l ly . A ru le of thumb is

to

choose the most important d imensions f i rst . A

d imensional

stacking v isual izat ion of o i l min ing data with longi tude

and

Page 25: Visual Data Mining

lat i tude mapped to the outer x and y axes, as wel l as ore

grade and depth mapped to the inner x and y axes is

shown

in f igure 5. Other examples of stacked d isp lay

techniques

inc lude Wor lds-with in-Wor lds [36] , Treemap [12] [13] ,

and

Cone Trees [37] .

V. Interact ion and Distort ion Techniques

In addi t ion to the v isual izat ion technique, for an

ef fect ive

data explorat ion i t i s necessary to use some interact ion

and d istort ion techniques. Interact ion techniques a l low

the

data analyst to d i rect ly interact wi th the v isual izat ions

and

dynamical ly change the v isual izat ions according to the

explorat ion

object ives, and they a lso make i t poss ib le to re late

and combine mult ip le independent v isual izat ions.

Distort ion

techniques help in the data explorat ion process by

provid ing means for focus ing on deta i ls whi le preserv ing

an overv iew of the data. The bas ic idea of d istort ion

techniques

Page 26: Visual Data Mining

i s to show port ions of the data with a h igh level of

deta i l whi le others are shown with a lower level of

deta i l .

We d ist inguish between the terms dynamic and

interact ive

depending on whether the changes to the v isual izat ions

are

made automat ica l ly or manual ly (by d i rect user

interact ion) .

Dynamic Pro ject ions

The bas ic idea of dynamic pro ject ions is to dynamical ly

change the pro ject ions in order to explore a

mult id imensional

data set . A c lass ic example is the Grand-

Tour system [15] which tr ies to show al l interest ing

twodimensional

pro ject ions of a mult i -d imensional data set as

a ser ies of scatter p lots . Note that the number of

poss ib le

pro ject ions is exponent ia l in the number of d imensions,

i .e .

i t i s intractable for a large d imensional i ty . The sequence

of

pro ject ions shown can be random, manual , precomputed,

or data dr iven. Systems support ing dynamic pro ject ion

Page 27: Visual Data Mining

techniques are XGobi [38] [39] , XL ispStat [40] , and

ExplorN

[41] .

Interact ive F i l ter ing

In explor ing large data sets , i t i s important to

interact ive ly

part i t ion the data set into segements and focus on

interest ing subsets . This can be done by a d i rect

se lect ion

of the des i red subset (browsing) or by a speci f icat ion

of propert ies of the des i red subset (query ing) . Browsing

is

very d i f f icu l t for very large data sets and query ing often

does not produce the des i red resul ts . Therefore a

number

of interact ion techniques have been developed to

improve

interact ive f i l ter ing in data explorat ion. An example of

an

interact ive tool which can be used for an interact ive

f i l ter ing

are Magic Lenses [42] [43] . The bas ic idea of Magic

Lenses is to use a tool l ike a magni fy ing g lasses to

support

f i l ter ing the data d i rect ly in the v isual izat ion. The data

Page 28: Visual Data Mining

under the magni fy ing g lass is processed by the f i l ter ,

and

the resul t is d isp layed d i f ferent ly than the remain ing

data

set . Magic Lenses show a modi f ied v iew of the se lected

region,

whi le the rest of the v isual izat ion remains unaf fected.

Note that severa l lenses with d i f ferent f i l ters may be

used; i f

the f i l ter over lap, a l l f i l ters are combined. Other

examples

of interact ive f i l ter ing techniques and tools are

InfoCrysta l

[44] , Dynamic Quer ies [45] [46] [47] , and Polar is [8]

(see

f igure 6 in [8] for an example) .

Interact ive Zooming

Zooming is a wel l -known technique which is widely used

in a number of appl icat ions. In deal ing with large

amounts

of data, i t i s important to present the data in a h ighly

compressed

form to provide an overv iew of the data but at the

same t ime a l low a var iable d isp lay of the data on

d i f ferent

Page 29: Visual Data Mining

resolut ions. Zooming does not only mean to d isp lay the

data objects larger but i t a lso means that the data

representat ion

automat ica l ly changes to present more deta i ls on

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER

GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 105

Fig. 6 . Table Lenses (used by permiss ion of R. Rao, Xerox

PARC

c ACM)

higher zoom levels . The objects may, for example, be

represented

as s ingle p ixels on a low zoom level , as icons on an

intermediate zoom level , and as labeled objects on a

h igh

resolut ion. An interest ing example apply ing the zooming

idea to large tabular data sets is the TableLens approach

[48] . Gett ing an overv iew of large tabular data sets is

d i f f icu l t

i f the data is d isp layed in textual form. The bas ic

idea of TableLens is to represent each numerica l va lue

by a

smal l bar . A l l bars have a one-p ixel height and the

lengths

are determined by the attr ibute values. This means that

Page 30: Visual Data Mining

the number of rows on the d isp lay can be near ly as h igh

as

the vert ica l resolut ion and the number of co lumns

depends

on the maximum width of the bars for each attr ibute.

The

in i t ia l v iew a l lows the user to detect patterns,

corre lat ions,

and out l iers in the data set . In order to explore a region

of interest the user can zoom in, wi th the resul t that the

af fected rows (or co lumns) are d isp layed in more deta i l ,

poss ib ly even in textual form. F igure 6 shows an

example

of a basebal l database with a few rows being selected

in fu l l deta i l . Other examples of techniques and systems

which use interact ive zooming inc lude PAD++ [49] [50]

[51] , IVEE/Spotf i re [52] , and DataSpace [53] . A

compar ison

of f isheye and zooming techniques can be found in [54] .

Interact ive Distort ion

Interact ive d istort ion techniques support the data

explorat ion

process by preserv ing an overv iew of the data dur ing

dr i l l -down operat ions. The bas ic idea is to show port ions

of

Page 31: Visual Data Mining

the data with a h igh level of deta i l whi le others are

shown

with a lower level of deta i l . Popular d istort ion

techniques

are hyperbol ic and spher ica l d istort ions which are often

used on h ierarchies or graphs but may be a lso appl ied to

any other v isual izat ion technique. An example of

spher ica l

d istort ions is provided in the Scalable Framework paper

(see f igure 5 in [10]) . An overv iew of d istort ion

techniques

is provided in [55] and [56] . Examples of d istort ion

techniques

inc lude Bi focal Disp lays [57] , Perspect ive Wal l [58] ,

Graphica l F isheye Views [59] [60] , Hyperbol ic

V isual izat ion

[61] [62] , and Hyperbox [63] .

Interact ive L ink ing and Brushing

There are many poss ib i l i t ies to v isual ize

mult id imensional

data but a l l o f them have some strength and

some weaknesses. The idea of l ink ing and brushing is to

combine d i f ferent v isual izat ion methods to overcome the

shortcomings of s ingle techniques. Scatterp lots of

d i f ferent

Page 32: Visual Data Mining

project ions, for example, may be combined by co lor ing

and

l ink ing subsets of points in a l l pro ject ions. In a s imi lar

fashion,

l ink ing and brushing can be appl ied to v isual izat ions

generated by a l l v isual izat ion techniques descr ibed

above.

As a resul t , the brushed points are h ighl ighted in a l l

v isual izat ions,

making i t poss ib le to detect dependencies and

corre lat ions. Interact ive changes made in one

v isual izat ion

are automat ica l ly ref lected in the other v isual izat ions.

Note that connect ing mult ip le v isual izat ions through

interact ive

l ink ing and brushing provides more informat ion than

consider ing the component v isual izat ions independent ly .

Typica l examples of v isual izat ion techniques which are

combined by l ink ing and brushing are mult ip le

scatterp lots ,

bar charts , para l le l coordinates, p ixel d isp lays, and

maps.

Most interact ive data explorat ion systems a l low some

form

of l ink ing and brushing. Examples are Polar is (see f igure

Page 33: Visual Data Mining

7 in [8]) and the Scalable Framework (see f igures 12 and

14 in [10]) . Other tools and systems inc lude S P lus [64] ,

XGobi [38] [65] , Xmdv [14] , and DataDesk [66] [67] .

VI . Conclus ion

The explorat ion of large data sets is an important but

d i f f icu l t

problem. Informat ion v isual izat ion techniques may

help to so lve the problem. Visual data explorat ion has

a h igh potent ia l and many appl icat ions such as f raud

detect ion

and data mining wi l l use informat ion v isual izat ion

technology for an improved data analys is .

Future work wi l l involve the t ight integrat ion of

v isual izat ion

techniques with t radi t ional techniques f rom such

disc ip l ines as stat ist ics , maschine learning, operat ions

research,

and s imulat ion. Integrat ion of v isual izat ion techniques

and these more establ ished methods would combine

fast automat ic data mining a lgor i thms with the intu i t ive

power of the human mind, improving the qual i ty

and speed of the v isual data mining process. V iusal data

mining techniques a lso need to be t ight ly integrated

with

Page 34: Visual Data Mining

the systems used to manage the vast amounts of

re lat ional

and semistructured informat ion, inc luding database

management

and data warehouse systems. The u l t imate goal

is to br ing the power of v isual izat ion technology to every

desktop to a l low a better , faster and more intu i t ive

explorat ion

of very large data resources. This wi l l not only be

valuable in an economic sense but wi l l a lso st imulate

and

del ight the user .

References

[1] B. Shneiderman, “The eye have i t : A task by data

type taxonomy

for informat ion v isual izat ions,” in V isual Languages ,

1996.

[2] S . Card, J . Mackin lay, and B. Shneiderman, Readings

in Informat ion

Visual izat ion , Morgan Kaufmann, 1999.

[3] C. Ware, Informat ion Visual izat ion: Percept ion for

Design,

Morgen Kaufman, 2000.

[4] B. Spence, Informat ion Visual izat ion , Pearson

Educat ion

Page 35: Visual Data Mining

Higher Educat ion publ ishers , UK, 2000.

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER

GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 106

[5] H. Schumann and W. M¨ul ler , V isual is ierung:

Grundlagen und

al lgemeine Methoden , Spr inger , 2000.

[6] D. Keim, “Visual explorat ion of large databases,”

Communicat ions

of the ACM, vol . 44, no. 8 , pp. 38–44, 2001.

[7] L . Nowel l S . Havre, B. Hetz ler and P. Whitney,

“Themeriver :

V isual iz ing themat ic changes in large document

co l lect ions,”

Transact ions on Visual izat ion and Computer Graphics ,

2001.

[8] D. Tang C. Sto l te and P. Hanrahan, “Polar is : A system

for query, analys is and v isual izat ion of mult i -d imensional

re lat ional

databases,” Transact ions on Visual izat ion and Computer

Graphics , 2001.

[9] J . Abel lo and J . Korn, “Mgv: A system for v isual iz ing

mass ive

mult i -d igraphs,” Transact ions on Visual izat ion and

Computer

Graphics , 2001.

Page 36: Visual Data Mining

[10] N. Lopez M. Kreuseler and H. Schumann, “A scalable

f ramework

for informat ion v isual izat ion,” Transact ions on

Visual izat ion

and Computer Graphics , 2001.

[11] D. Keim, “Designing p ixel -or iented v isual izat ion

techniques:

Theory and appl icat ions,” Transact ions on Visual izat ion

and

Computer Graphics , vo l . 6 , no. 1 , pp. 59–78, Jan–Mar

2000.

[12] B. Shneiderman, “Tree v isual izat ion with t reemaps:

A 2D spacef i l l ing

approach,” ACM Transact ions on Graphics , vo l . 11, no.

1, pp. 92–99, 1992.

[13] B. Johnson and B. Shneiderman, “Treemaps: A

space- f i l l ing

approach to the v isual izat ion of h ierarchica l

in format ion,” in

Proc. V isual izat ion ’91 Conf , 1991, pp. 284–291.

[14] M. O. Ward, “Xmdvtool : Integrat ing mult ip le

methods for v isual iz ing

mult ivar iate data,” in Proc. V isual izat ion 94,

Washington,

DC, 1994, pp. 326–336.

Page 37: Visual Data Mining

[15] D. As imov, “The grand tour: A tool for v iewing

mult id imensional

data,” S IAM Journal of Sc ience & Stat . Comp. , vo l . 6 ,

pp. 128–143, 1985.

[16] A. Inselberg and B. Dimsdale, “Para l le l coordinates:

A tool for

v isual iz ing mult i -d imensional geometry,” in Proc.

V isual izat ion

90, San Francisco, CA , 1990, pp. 361–370.

[17] J . A . Wise, J . J . Thomas, K. Pennock, D. Lantr ip , M.

Pott ier ,

Schur A. , and V. Crow, “Visual iz ing the non-v isual :

Spat ia l

analys is and interact ion with informat ion f rom text

documents,”

in Proc. Symp. on Informat ion Visual izat ion, At lanta,

GA, 1995, pp. 51–58.

[18] C. Chen, Informat ion Visual isat ion and Vir tual

Environments ,

Spr inger-Ver lag, London, 1999.

[19] M. Dodge, “Web v isual izat ion,”

http: / /www.geog.uc l .ac.uk/

casa/mart in/geography of cyberspace.html , oct 2001.

[20] G. D. Batt is ta , P . Eades, R. Tamassia , and I . G.

To l l i s , Graph

Page 38: Visual Data Mining

Drawing, Prent ice Hal l , 1999.

[21] J . Tr i lk , “Software v isual izat ion,” http: / /wwwbroy.

informat ik . tu-muenchen.de/˜tr i lk /sv.html , Oct 2001.

[22] D. F . Andrews, “P lots of h igh-d imensional data,”

B iometr ics ,

vo l . 29, pp. 125–136, 1972.

[23] W. S. C leveland, V isual iz ing Data , AT&T Bel l

Laborator ies ,

Murray Hi l l , N J , Hobart Press, Summit NJ , 1993.

[24] P . J . Huber, “The annals of stat ist ics ,” Pro ject ion

Pursui t , vo l .

13, no. 2 , pp. 435–474, 1985.

[25] G. W. Furnas and A. Buja, “Prosect ions v iews:

Dimensional

inference through sect ions and pro ject ions,” Journal of

Computat ional

and Graphica l Stat ist ics , vo l . 3 , no. 4 , pp. 323–353,

1994.

[26] R. Spence, L . Tweedie, H. Dawkes, and H. Su,

“Visual izat ion

for funct ional des ign,” in Proc. Int . Symp. on Informat ion

Visual izat ion

( InfoVis ’95) , 1995, pp. 4–10.

[27] J . J . van Wi jk and R. . D. van L iere, “Hypers l ice,” in

Proc.

Page 39: Visual Data Mining

Visual izat ion ’93, San Jose, CA , 1993, pp. 119–125.

[28] H. Chernof f , “The use of faces to represent points in

kdimensional

space graphica l ly ,” Journal Amer. Stat ist ica l Associat ion ,

vol . 68, pp. 361–368, 1973.

[29] R. M. P ickett and G. G. Gr inste in, “ Iconographic

d isp lays for

v isual iz ing mult id imensional data,” in Proc. IEEE Conf .

on

Systems, Man and Cybernet ics , IEEE Press, P iscataway,

NJ ,

1988, pp. 514–519.

[30] H. Levkowitz , “Color icons: Merging co lor and

texture percept ion

for integrated v isual izat ion of mult ip le parameters ,” in

Proc. V isual izat ion 91, San Diego, CA , 1991, pp. 22–25.

[31] D. A. Keim and H. -P . Kr iegel , “Visdb: Database

explorat ion

using mult id imensional v isual izat ion,” Computer

Graphics &

Appl icat ions , vo l . 6 , pp. 40–49, Sept . 1994.

[32] M. Hearst , “T i lebars: V isual izat ion of term

distr ibut ion informat ion

in fu l l text informat ion access,” in Proc. of ACM Human

Page 40: Visual Data Mining

Factors in Comput ing Systems Conf . (CHI ’95) , 1995, pp.

59–66.

[33] D. A. Keim, H. -P . Kr iegel , and M. Ankerst ,

“Recurs ive pattern:

A technique for v isual iz ing very large amounts of data,”

in Proc.

Visual izat ion 95, At lanta, GA , 1995, pp. 279–286.

[34] M. Ankerst , D. A. Keim, and H. -P . Kr iegel , “Ci rc le

segments:

A technique for v isual ly explor ing large mult id imensional

data

sets ,” in Proc. V isual izat ion 96, Hot Topic Sess ion, San

Francisco,

CA, 1996.

[35] J . LeBlanc, M. O. Ward, and N. Witte ls , “Explor ing

ndimensional

databases,” in Proc. V isual izat ion ’90, San Francisco,

CA, 1990, pp. 230–239.

[36] S. Fe iner and C. Beshers , “Visual iz ing n-d imensional

v i r tual

wor lds with n-v is ion,” Computer Graphics , vo l . 24, no. 2 ,

pp.

37–38, 1990.

[37] G. G. Robertson, J . D. Mackin lay, and S. K . Card,

“Cone

Page 41: Visual Data Mining

t rees: Animated 3D v isual izat ions of h ierarchica l

in format ion,”

in Proc. Human Factors in Comput ing Systems CHI 91

Conf . ,

New Or leans, LA , 1991, pp. 189–194.

[38] D. F . Swayne, D. Cook, and A. Buja, User ’s Manual

for XGobi :

A Dynamic Graphics Program for Data Analys is , Bel lcore

Technica l

Memorandum, 1992.

[39] A. Buja, D. F . Swayne, and D. Cook, “ Interact ive

h ighdimensional

data v isual izat ion,” Journal of Computat ional and

Graphica l Stat ist ics , vo l . 5 , no. 1 , pp. 78–99, 1996.

[40] L . T ierney, “L ispstat : An object -or ientated

environment for

stat ist ica l comput ing and dynamic graphics ,” in Wi ley,

New

York, NY, 1991.

[41] D. B. Carr , E . J . Wegman, and Q. Luo, “Explorn:

Design considerat ions

past and present ,” in Technica l Report , No. 129,

Center for Computat ional Stat ist ics , George Mason

Univers i ty ,

1996.

Page 42: Visual Data Mining

[42] E . A. B ier , M. C. Stone, K. P ier , W. Buxton, and T.

DeRose,

“Toolg lass and magic lenses: The see-through inter face,”

in

Proc. S IGGRAPH ’93, Anaheim, CA , 1993, pp. 73–80.

[43] K. F ishkin and M. C. Stone, “Enhanced dynamic

quer ies v ia

movable f i l ters ,” in Proc. Human Factors in Comput ing

Systems

CHI ’95 Conf . , Denver , CO , 1995, pp. 415–420.

[44] A. Spoerr i , “ Infocrysta l : A v isual too l for informat ion

retr ieval ,”

in Proc. V isual izat ion ’93, San Jose, CA , 1993, pp. 150–

157.

[45] C. Ahlberg and B. Shneiderman, “Visual in format ion

seeking:

T ight coupl ing of dynamic query f i l ters with star f ie ld

d isp lays,”

in Proc. Human Factors in Comput ing Systems CHI ’94

Conf . ,

Boston, MA, 1994, pp. 313–317.

[46] S. G. E ick, “Data v isual izat ion s l iders ,” in Proc. ACM

UIST,

1994, pp. 119–120.

Page 43: Visual Data Mining

[47] J . Goldste in and S. F . Roth, “Us ing aggregat ion and

dynamic

quer ies for explor ing large data sets ,” in Proc. Human

Factors

in Comput ing Systems CHI ’94 Conf . , Boston, MA , 1994,

pp.

23–29.

[48] R. Rao and S. K . Card, “The table lens: Merging

graphica l

and symbol ic representat ion in an interact ive

focus+context v isual izat ion

for tabular informat ion,” in Proc. Human Factors

in Comput ing Systems CHI 94 Conf . , Boston, MA , 1994,

pp.

318–322.

[49] K. Per l in and D. Fox, “Pad: An a l ternat ive approach

to the

computer inter face,” in Proc. S IGGRAPH, Anaheim, CA ,

1993,

pp. 57–64.

[50] B. Bederson, “Pad++: Advances in mult isca le

inter faces,” in

Proc. Human Factors in Comput ing Systems CHI ’94

Conf . ,

Boston, MA, 1994, p . 315.

Page 44: Visual Data Mining

[51] B. B. Bederson and J . D. Hol lan, “Pad++: A zooming

graphica l

inter face for explor ing a l ternate inter face phys ics ,” in

Proc.

UIST, 1994, pp. 17–26.

[52] C. Ahlberg and E. Wistrand, “ Ivee: An informat ion

v isual izat ion

and explorat ion environment,” in Proc. Int . Symp. on

Informat ion Visual izat ion, At lanta, GA , 1995, pp. 66–73.

[53] V. Anupam, S. Dar , T . Le ibfr ied, and E. Peta jan,

“Dataspace:

3D v isual izat ion of large databases,” in Proc. Int . Symp.

on

Informat ion Visual izat ion, At lanta, GA , 1995, pp. 82–88.

[54] Schaffer , Doug, Zuo, Zhengping, Bartram, Lyn, Di l l ,

John,

Dubs, Shel l i , Greenberg, Saul , and Roseman, “Compar ing

f isheye

and fu l l -zoom techniques for navigat ion of h ierarchica l ly

c lustered networks,” in Proc. Graphics Inter face (GI ’93) ,

Toronto, Ontar io , 1993, in : Canadian Informat ion

Process ing

Soc. , Toronto, Ontar io , Graphics Press, Cheshire, CT ,

1993,

pp. 87–96.

Page 45: Visual Data Mining

[55] Y . Leung and M. Apper ley, “A rev iew and taxonomy

of

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER

GRAPHICS, VOL. 7, NO. 1, JANUARY-MARCH 2002 107

distort ion-or iented presentat ion techniques,” in Proc.

Human

Factors in Comput ing Systems CHI ’94 Conf . , Boston, MA ,

1994, pp. 126–160.

[56] M. S . T . Carpendale, D. J . Cowperthwaite, and F. D.

Fracchia,

“ Ieee computer graphics and appl icat ions, specia l issue

on informat ion

visual izat ion,” IEEE Journal Press , vo l . 17, no. 4 , pp.

42–51, Ju ly 1997.

[57] R. Spence and M. Apper ley, “Data base navigat ion:

An of f ice

environment for the profess ional ,” Behaviour and

Informat ion

Technology, vo l . 1 , no. 1 , pp. 43–54, 1982.

[58] J . D. Mackin lay, G. G. Robertson, and S. K . Card,

“The perspect ive

wal l : Deta i l and context smoothly integrated,” in Proc.

Human Factors in Comput ing Systems CHI ’91 Conf . , New

Or leans,

LA, 1991, pp. 173–179.

Page 46: Visual Data Mining

[59] G. Furnas, “General ized f isheye v iews,” in Proc.

Human Factors

in Comput ing Systems CHI 86 Conf . , Boston, MA , 1986,

pp.

18–23.

[60] M. Sarkar and M. Brown, “Graphica l f isheye v iews,”

Communicat ions

of the ACM, vol . 37, no. 12, pp. 73–84, 1994.

[61] J . Lamping, Rao R. , and P. P i ro l l i , “A focus + context

technique

based on hyperbol ic geometry for v isual iz ing large

h ierarchies,”

in Proc. Human Factors in Comput ing Systems CHI 95

Conf . ,

1995, pp. 401–408.

[62] T . Munzner and P. Burchard, “Visual iz ing the

structure of the

wor ld wide web in 3D hyperbol ic space,” in Proc. VRML

’95

Symp, San Diego, CA , 1995, pp. 33–38.

[63] B. A lpern and L . Carter , “Hyperbox,” in Proc.

V isual izat ion

’91, San Diego, CA , 1991, pp. 133–139.

[64] R. Becker , J . M. Chambers, and A. R. Wi lks , “The

new s language,

Page 47: Visual Data Mining

wadsworth & brooks/co le advanced books and software,”

Paci f ic Grove, CA, 1988.

[65] R. A. Becker , W. S. C leveland, and M.- J . Shyu, “The

v isual

des ign and contro l of t re l l i s d isp lay,” Journal of

Computat ional

and Graphica l Stat ist ics , vo l . 5 , no. 2 , pp. 123–155,

1996.

[66] P . F Vel leman, Data Desk 4.2: Data Descr ipt ion ,

Data Desk,

I thaca, NY, 1992, 1992.

[67] A. Wi lhelm, A.R. Unwin, and M. Theus, “Software for

interact ive

stat ist ica l graphics - a rev iew,” in Proc. Int . Softstat 95

Conf . , Heidelberg, Germany , 1995.

Biography

DANIEL A. KEIM is work ing in the area of in format ion

visual izat ion and data mining. In the f ie ld of in format ion

visual izat ion, he developed several novel techniques

which

use v isual izat ion technology for the purpose of explor ing

large databases. He has publ ished extens ively on

informat ion

visual izat ion and data mining; he has g iven tutor ia ls

on re lated issues at severa l large conferences inc luding

Page 48: Visual Data Mining

Visual izat ion, S IGMOD, VLDB, and KDD; he has been

program co-chair of the IEEE Informat ion Visual izat ion

Symposia in 1999 and 2000; he is program co-chair of

the

ACM SIGKDD conference in 2002; and he is an edi tor of

TVCG and the Informat ion Visual izat ion Journal .

Danie l Keim received h is d ip loma (equivalent to an MS

degree) in Computer Sc ience f rom the Univers i ty of

Dortmund

in 1990 and h is Ph.D. in Computer Sc ience f rom the

Univers i ty of Munich in 1994. He has been ass istant

professor

at the CS department of the Univers i ty of Munich,

associate professor at the CS department of the Mart in-

Luther-Univers i ty Hal le , and fu l l professor at the CS

department

of the Univers i ty of Constance. Current ly , he

is on leave f rom the Univers i ty of Constance, work ing at

AT&T Shannon Research Labs, F lorham Park, NJ , USA