visual data mining in large geospatial - ufpejtalr/mestrado/01333626.pdf · visual data mining in...

9
M any existing and emergent applications collect and reference data by geospatial location. Credit card transactions, for example, include addresses of both the place of purchase and the pur- chaser; telephone records include addresses and some- times cell phone zones and geocoordinates; and population census tables contain addresses and other location information. These data sets are sources of potentially valu- able information that can give their holders a competitive advantage. Government agencies also publish a wealth of statistical information that data analysts can apply to key problems in public health and safe- ty or combine with proprietary data. The difficulty lies in finding the details that reveal the fine struc- tures hidden in this data. Many approaches to analyzing such data exist—for example, statis- tical models, clustering, and associ- ation rules. Effective spatial data mining, however, must focus on finding location-related patterns and relationships. Interactive visual data exploration is important to spatial data mining. 1,2 The wide area lay- out data observer (Waldo) involves the analyst in data exploration, thus complementing human perceptual skills, imagination, and flexibility with current comput- er systems to process large volumes of data and gener- ate sophisticated displays. In this setting, the analyst directly interacts with the data, solving problems by applying domain expertise and general background knowledge to form and validate new hypotheses. In recent decades, visual data-mining techniques have proven valuable in exploratory data analysis, and they have strong potential in the exploration of large data- bases. 3 Visual data exploration is particularly useful when little is known about the data and when goals are indistinct. Because users directly guide the exploration process, they can easily shift or adjust the goals as need- ed. However, analyzing the torrent of spatial details available in these large (terabytes and beyond) databas- es to extract interesting knowledge or general charac- teristics is almost impossible for users. Thus we need new, more scalable visual techniques. Visualizing geospatial data Geospatial data describe real-world objects or phe- nomena with specific locations and associated statisti- cal values or attributes. By considering just one statistical attribute at a time, we can interpret geospatial data sets as points in a 3D data space—that is, two geo- graphical dimensions and a statistical dimension. Because real-world data set distributions are often nonuniform, the data points form readily identifiable 3D point clouds. Figure 1a shows a household income distribution in a 3D data space spanned by longitude, latitude, and median household income. Figure 1b shows an xy-plot of the 3D point clouds. Visualizing large geospatial data sets involves map- ping the two geographical dimensions to screen coordi- nates and encoding the statistical value by color. (Keim et al. give a good overview of visual data-mining tech- niques for geospatial data sets. 4 ) The difficulty is finding a useful mapping function f. When using a simple dot plot mapping function f, developers encounter two important visualization chal- lenges: Overplotting obscures data points in densely popu- lated areas; however, sparsely populated areas waste space while conveying scant detailed information. Small clusters are difficult to find. In general, they aren’t noticeable enough in conventional maps and are often occluded by large clusters. These difficulties lead to three important visual explo- ration goals for geospatial data, which we express as mapping constraints: no overlap, position preservation, and clustering. (The “Visual Exploration Goals” sidebar describes these goals.) We bring visualization to data analysts’ desktops to Visual Analytics The Wide Area Layout Data Observer (Waldo) complements uniquely human abilities with current computing technologies to find location-related patterns in large geospatial data sets. Daniel A. Keim, Christian Panse, and Mike Sips University of Constance, Germany Stephen C. North AT&T Labs 36 September/October 2004 Published by the IEEE Computer Society 0272-1716/04/$20.00 © 2004 IEEE Visual Data Mining in Large Geospatial Point Sets

Upload: others

Post on 25-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visual Data Mining in Large Geospatial - UFPEjtalr/Mestrado/01333626.pdf · Visual Data Mining in Large Geospatial Point Sets. IEEE Computer Graphics and Applications 37 Visual Exploration

Many existing and emergent applicationscollect and reference data by geospatial

location. Credit card transactions, for example, includeaddresses of both the place of purchase and the pur-chaser; telephone records include addresses and some-times cell phone zones and geocoordinates; andpopulation census tables contain addresses and other

location information. These datasets are sources of potentially valu-able information that can give theirholders a competitive advantage.Government agencies also publisha wealth of statistical informationthat data analysts can apply to keyproblems in public health and safe-ty or combine with proprietarydata. The difficulty lies in findingthe details that reveal the fine struc-tures hidden in this data.

Many approaches to analyzingsuch data exist—for example, statis-tical models, clustering, and associ-ation rules. Effective spatial datamining, however, must focus onfinding location-related patterns

and relationships. Interactive visual data exploration isimportant to spatial data mining.1,2 The wide area lay-out data observer (Waldo) involves the analyst in dataexploration, thus complementing human perceptualskills, imagination, and flexibility with current comput-er systems to process large volumes of data and gener-ate sophisticated displays. In this setting, the analystdirectly interacts with the data, solving problems byapplying domain expertise and general backgroundknowledge to form and validate new hypotheses.

In recent decades, visual data-mining techniques haveproven valuable in exploratory data analysis, and theyhave strong potential in the exploration of large data-bases.3 Visual data exploration is particularly usefulwhen little is known about the data and when goals areindistinct. Because users directly guide the explorationprocess, they can easily shift or adjust the goals as need-

ed. However, analyzing the torrent of spatial detailsavailable in these large (terabytes and beyond) databas-es to extract interesting knowledge or general charac-teristics is almost impossible for users. Thus we neednew, more scalable visual techniques.

Visualizing geospatial dataGeospatial data describe real-world objects or phe-

nomena with specific locations and associated statisti-cal values or attributes. By considering just onestatistical attribute at a time, we can interpret geospatialdata sets as points in a 3D data space—that is, two geo-graphical dimensions and a statistical dimension.Because real-world data set distributions are oftennonuniform, the data points form readily identifiable3D point clouds. Figure 1a shows a household incomedistribution in a 3D data space spanned by longitude,latitude, and median household income. Figure 1bshows an xy-plot of the 3D point clouds.

Visualizing large geospatial data sets involves map-ping the two geographical dimensions to screen coordi-nates and encoding the statistical value by color. (Keimet al. give a good overview of visual data-mining tech-niques for geospatial data sets.4) The difficulty is findinga useful mapping function f.

When using a simple dot plot mapping function f,developers encounter two important visualization chal-lenges:

� Overplotting obscures data points in densely popu-lated areas; however, sparsely populated areas wastespace while conveying scant detailed information.

� Small clusters are difficult to find. In general, theyaren’t noticeable enough in conventional maps andare often occluded by large clusters.

These difficulties lead to three important visual explo-ration goals for geospatial data, which we express asmapping constraints: no overlap, position preservation,and clustering. (The “Visual Exploration Goals” sidebardescribes these goals.)

We bring visualization to data analysts’ desktops to

Visual Analytics

The Wide Area Layout Data

Observer (Waldo)

complements uniquely

human abilities with current

computing technologies to

find location-related

patterns in large geospatial

data sets.

Daniel A. Keim, Christian Panse, and Mike SipsUniversity of Constance, Germany

Stephen C. NorthAT&T Labs

36 September/October 2004 Published by the IEEE Computer Society 0272-1716/04/$20.00 © 2004 IEEE

Visual Data Miningin Large GeospatialPoint Sets

Page 2: Visual Data Mining in Large Geospatial - UFPEjtalr/Mestrado/01333626.pdf · Visual Data Mining in Large Geospatial Point Sets. IEEE Computer Graphics and Applications 37 Visual Exploration

IEEE Computer Graphics and Applications 37

Visual Exploration GoalsWe define the visualization of georeferenced data as a

mapping of input data points, with their associatedpositions and statistical attributes, to unique positions on anoutput map. The mapping function must satisfy three mainconstraints: no overlap, position preservation, and clustering.We formally define these constraints as follows.

Let A be the set of input points A = {a0, …, aN−1}, whereai=a x

i,a yi is each point’s original position and S1(ai), …, Sk(ai)

are its associated statistical parameters. We assume A islarge, so we will likely have many data points i and j forwhich the original positions are very close or evenidentical—that is, ai ≈ aj. We define the data display space(screen or window space) DS ⊂ INT2 as DS = {0, … , xmax − 1}× {0, …, ymax − 1}, where xmax and ymax are the displaybounds. The algorithm attempts to determine a mappingfunction f of the original data set to a solution set B = {b0,…, bN−1, 0 ≤ b x

i ≤ xmax =1, 0 ≤ b yi ≤ ymax =1 such that f : A → B,

f(ai) = bi ∀i = {0, …, N − 1}—that is, f determines the newposition bi of ai.

Figure A shows graphical representations of the mappingconstraints. Visual exploration techniques aim to balancethe position preservation and clustering constraints underthe condition that the no-overlap constraint is alwayssatisfied.

No overlap. All data points are individually visible, witheach assigned a unique pixel position (Figure A1). Weexpress this formally as i ≠ j ⇒ bi ≠ bj ∀i, j ∈ {1, …, N − 1}.

Position preservation. New positions should be as close aspossible to the original positions. We measure thisconstraint using the points’ absolute distance from theiroriginal positions (Figure A2) or their relative distance fromeach other (Figure A3). This gives us the followingoptimization goals:

LongitudeLatitude

Income

x

y

l

lll

l

lllll lll

l

l

l

llll

l

l

l

llll

l

lll

ll

l ll

l

llll l0.00.20.40.60.81.0

.0 .2 .4 .6 .8 1.0

Income

lll

l

lll

l

lllll lllllllllllllllllllllllllllllllllllllll

Income

ll

l

lllll llllllllllll

l

llllllll

l

lllll

ll

llllllllllllllllllllllllll ll llllll

l

l ll

ll

llll

l

l

l

llllllllllllll

l

ll l

ll

llll

l

l

lllllllllllllll

ll

lllll

l

ll

l

ll

l

ll

lll

l

llllllll

ll

lll

l

lllll

l

llll

l

lll

l

llllll

ll

l

lll

ll

llll

llll

l

l

ll

lllll llllll

ll

llll

ll

l

l

ll lll

.0 .2 .4 .6 .8 1.0

Income

l

l

l

ll

l

l

l

ll

l

l

l

l

l

ll

l

ll

l lll

ll

ll

l l

l

ll

ll

lll

ll

ll

llll

l

lll

l

l

l

l

l

l

l

l l

ll

l

l ll

l

l

ll

l

l

l ll l

l

ll ll

l

l

ll ll

lll

ll

lllll

ll l

l

l ll

l

l

llll

l

ll

ll

l

l

l

lll

l

l

l

l

l

ll

l l

ll

l

lllll lll

l

ll

l

l

l l

l

l

l lllll

lll

lll

l

ll

l

l

ll ll

l

llll

l

l

l

lll

l

lll

l

ll

ll

lll

l

l

ll

l

l

ll

l

lll lll

ll

l l

l l

l

l l

l

l l

l

l

l

l

l

ll

lll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

ll

l

llll

l

ll

ll

l

ll

l

lll

l l

l

ll

l

l

l

lll

l

l

llll ll

l

l

l

llllllll

l

l

l

ll

ll

ll

l

ll

ll

l ll

l

l

l

ll

l

l

ll

l

l

llll

l

l

l

l

llllll

l

l

ll

l

l

ll

l l

llll

l

l

l ll

l

l

ll

l

l

l

ll

l l

l

l

l

ll

l

l

l

l

l

ll

l

l

l

ll ll

ll

l

l

ll

l

l l

l l

ll

l

ll

ll

lll

l

l

lll

l

llll

l

lllll

l

l

l l

l

l

l

l

l ll

l

ll

ll

l

ll

l ll

l

ll

l

ll

l

l

ll

l

l

l

ll

l

l

l

ll

l

l

ll l

l

ll

l

l

lll

ll l ll

l

ll

l

l l

lllll

l

l

l

lll

l

l

l ll

ll

ll

l

l

ll

l

l

l

l

ll

lll

l

ll

l

l

l

ll

ll ll

l

l

ll l

l

l

lll

ll

ll

ll

l

l

l

l

ll l

ll

lllll

l

lll

l

l

l l

l

l

lll

l

l

l l

l lllllllll

l

l

l

lll ll

l

lll

lll l

lll

ll

llll

l

l

ll l

ll

ll

l

l

l

l

l

lll

ll

l

l

l

l

ll

llll ll

l

ll l

l

l

llllllll

l ll

l l ll

l

l

ll

l

ll

l

llll l

lll

llll

ll

l l

l

l

l

l

l

l

llllll

l

ll

ll

ll

l

lll

l

l

ll

l

l

l

l

l

l

lll

l

ll

l

lll

l

l

ll

l

l l

l

llll

ll

lllll

l

l

ll

l

lll

l

l

l

l

l

l

ll

l

ll

l

ll l

l

ll

l

ll

ll

ll

ll

l l

l

l

lll

l ll

llll

ll lll

llll

l

ll

ll

ll l

l

ll

l

l

ll l

l

l

ll

l

ll

ll

l

l

l

l

l

l l

llll

l

l

l

l

l l

l

l

l

Income

ll

l

ll

l

ll

l l

lll

l

ll

l

lllll lll

l

ll

l

ll l

l

llll

ll l

l

ll

l

llllllll

l

l

l

ll ll

l

llll l

lll

l

l

l

l

l

ll l

l llll

l

ll

ll

l

l ll

l

ll

llll

l l

l

l

ll

l

ll

l

l

l lll

l

lllll

l

l

lll

l ll l

l

l

l

ll ll ll

l ll

l l l

ll l

l

ll

ll

l

l

ll l

l ll

ll

l

l

lll

l

lll

lll

l

l

ll

ll lll

ll l

ll

lll

l

lll

l

l

l

llll

ll l

l

l

l

ll

l

lll

l

l

l

l

l

l

llll

l

l

l

l l

l

ll

ll

l

l

l

l

ll lll llll

l

l

l

lll

l

l

lll

l

lll ll l

llll l llll

ll

ll l

l ll

l

ll

ll

l

ll

lll

l

llll ll

l

l

l

l ll l

ll

l

l

l l lll l

l

ll ll ll

lll

ll ll

lll l

l

l

l

ll

lll lll

ll

l

l

l

ll ll

l l

l

l

l

l

l l lllll l l

ll

l l

l

l

l

l

lll llll

lll

l

l

l

l

l

l

l

l

l

l

l

l

ll

l

ll

ll llll

l

l

llll l

l

ll

l

ll l

l

l llll

l

llll ll llll

l

l

l

l ll

l

llllll l

l

lllll

ll

l

l

l

l l

l llll ll

l

ll lllll

ll

llll l

ll

ll

l

l

l

l

lll

ll lll

l

lll

l

lll ll l

l

lll

l

l

l

l

ll

l

ll l

l

lll l

l

lll ll

l

l

ll

l

l

l

l

l

ll l

l ll

l

l

ll l

lll

l

l

l

ll

l

l

l llll

l

l l

l

ll

l

lll

l

lll

l

ll

ll

l

l

l

l l l

lll

l

l

l

l lll

l

l l

ll

ll

l

l

l

l

llllllllll l llll

l

l lll

l

ll

l

lll ll

l l

l ll l

l

lll l

l

l

ll

l

l

lll

l

l

l

l l ll

l l

ll l

l

l

l

l

l l llll

l

l l

ll

ll l

l

ll

l l

l

ll

l ll

ll

l

l

ll l

l l l

l

l

l

l

llllll

l

ll

l

ll

lllllll

l

ll

l

lll

l

l lll

l

l ll

l

ll

ll ll

l l

l

l

ll

l

l l

l ll ll ll

lll

l

lll ll

l

l

lllll l

lll l

lll l

l l

l

ll

ll

ll

llll

l

l ll

l

l

l

l

l

lll

l

ll

ll

llll ll

l

l

l

l

Income

l lll

l l

l

l l

l

lllllll

ll l lll lll

l

ll llllllll lll

lll llll ll

l

l

l llll llllllll ll

l

l ll llll

llll l ll

l

lll

ll

lll lll l llll

l

lll

l

l

l

l lll

llllllll ll

l l

ll

lllll ll l

l

lll

l

l

ll

l

llll

l

l l lllllll lll ll llll llll ll lllll

l

llll

l

l l l

l

llllll

ll l ll

ll

l ll l ll llll ll ll ll

l

ll

l

lllll ll

llll l

l

ll

l

l

l

l lllll l ll l

l l

l l

l

ll ll

lll l

l

l llll

ll

l l l llllll ll

l

lll

l

llll llll lll lll ll llll

l ll lll

l ll l

l l

lll llll

l

ll

ll lll

l

l llll llll

llllllll

ll

ll lll ll lll lllll lll lll

ll lll

l

l

ll

l l

0.00.20.40.60.81.0

Incomell lll

l

l ll

l

lll l

l

l lllll ll

l

ll lll ll lll lll lllll

l

llllll ll l ll

lll lllllllll l0.0

0.20.40.60.81.0

Income

l llllll llll llll llllllll l llll

Income

.0 .2 .4 .6 .8 1.0

(a) (b)

1 Plotting of geospatial data points on (a) longitude, latitude, and statistical attribute (income in this example) 3D-point clouds ariseseven in small real-world data sets (1 percent of the data); and (b) an xy-plot of the 3D-point clouds. The goal is to display all 3D-pointclouds in a single continuous display without overlap. (Example shows a small sample, 1 percent, from the US Census; seehttp://www.census.gov, 1999 New York Household Income data set.)

continued on p. 38

A Problem definition constraints: (1) no overlap, (2) absolute position preservation, (3) relative position preservation, and (4)clustering constraint. Visual exploration aims to find a good tradeoff between 2, 3, and 4 such that 1 is always satisfied.

(1) (2) (3) (4)

Page 3: Visual Data Mining in Large Geospatial - UFPEjtalr/Mestrado/01333626.pdf · Visual Data Mining in Large Geospatial Point Sets. IEEE Computer Graphics and Applications 37 Visual Exploration

encourage more intuitive and pro-ductive exploration of large geospa-tial data resources. High-resolutionpixilated displays are increasinglyavailable in both wall-sized anddesktop units. Although extra dis-play pixels let us show more data,this technology alone doesn’t elimi-nate overplotting. Figure 2 shows theresulting visualization in case ofzoom in. Figure 3 shows how thedegree of overlap—that is, thedegree to which data points share apixel position—varies with screenresolution. Although the number ofpoints assigned to the same positiondecreases as resolution increases,even large, high-resolution displays,such as the display in Figure 4a, can’tachieve a zero overlap and could losepotentially interesting patterns.

Another common solution is toaggregate all data points in eachregion and only show a summary.With this approach, the visualiza-tion reflects all the data points butdoesn’t show all available informa-tion in the dense regions. (See the“Related Work” sidebar for a discus-sion of other approaches.)

Wide area layout dataobserver

In addition to a basic visualizationtechnique, successful data explo-ration often involves adjusting the

Visual Analytics

38 September/October 2004

� absolute position preservation,

� relative position preservation,

The application determines the weighting betweenabsolute and relative position preservation to be used.

We define the distance function d by an Lm − norm (m =1 or 2):

Clustering. The clustering constraint involves repositioningthe data points so those with high similarity in statistical

attribute Si (where Si, i ∈ {0, … , k}) are near each other(Figure A4). (We assume clustering depends on thestatistical attribute S ∈ {S0, … , Sk}.) In other words, otherpoints in the neighborhood of any given data point shouldhave similar values, yielding pixel coherence. To formalizethis constraint, we define the neighborhood NH of a datapoint ai and a distance function dS on the statisticalattribute S:

This neighborhood function sums up all the differences in Sbetween each point and its neighboring points. We definethe function as NH(bi) = {bj | d(bi, bj) < ε}.

Because Si can have a highly nonuniform distribution,applying nonlinear scaling to S before computing distancesdS might also be necessary. In addition, in some situationsmany similar points might be in some regions of the map,while only a few are in others. In this case, varying ε in theregion under consideration might be helpful.

d S b S bS i jb NH bi

N

j i

( ( ), ( )) min( )

→∈=

∑∑0

1

d b b b b b bi j ix

jx

m

iy

jy m

m,( ) = −( ) + −( )

∑ ∑ ) − )((

=

= ≠

i

N

j i j

N

i j i jd b b d a a0

1

0

1 2

,, , mmin

∑ ( ) →=

i

N

i id a b0

1

, min

continued from p. 37

2 Zooming solves neither the overplotting nor pixel coherence problems. (a) Overplottingon a conventional map with interactive zooming, showing only a small sample of the data. (b)PixelMap showing 100 percent of the data without overplotting. (c) Household income his-togram. Green represents low income and red represents high income.

(a)

(b)

(c)

Page 4: Visual Data Mining in Large Geospatial - UFPEjtalr/Mestrado/01333626.pdf · Visual Data Mining in Large Geospatial Point Sets. IEEE Computer Graphics and Applications 37 Visual Exploration

visual representation of data to suitthe task at hand. Waldo is a pixel-based visual exploration systemcombining several relevant interac-tion techniques. Such techniques letdata analysts directly interact withthe geospatial data visualizations,dynamically changing them accord-ing to the exploration objectives.Waldo also lets analysts relate andcombine multiple visualizations.

Waldo is more effective thanstand-alone automatic data-miningtechniques in that it

� yields results quickly, with a highdegree of user satisfaction andconfidence in findings;

� lets analysts guide the search andshift or adjust goals on the fly;

� deals with nonhomogeneous andnoisy data;

� requires less understanding ofcomplex mathematical or statisti-cal algorithms or parameters; and

IEEE Computer Graphics and Applications 39

Resolution (pixels)

Deg

ree

of o

verla

p

Map like visualization

no longer useful

Screen resolution30% of all data points can’t be directly placed

Powerwall

0 × 0500 × 500

1,000 × 1,0001,500 × 1,500

2,000 × 2,000

0.0

0.2

0.4

0.6

0.8

1.0

3 Varying degree of pixel overlap depending on screen resolution. Evenwith a screen resolution of 1,600 × 1,200, overlap is about 0.3 degrees; 30percent of our data points (about 12,000 points) from the US Year 2000Census Household Income data set can’t be directly placed without overwrit-ing occupied pixels.

4 Large displays solve neither the overplotting nor thepixel coherence problems. Only alternative pixel-basedvisualization techniques can solve these problems. (a)In a conventional map, overplotting obscures datapoints even on high-resolution displays. (b) In Waldo,we avoid overplotting on a regular LCD-display.

Related WorkRather than aggregate data, Gridfit1 avoids overlap in the 2D

display by repositioning pixels locally. In areas with high overlap,however, the repositioning depends on the ordering of the pointsin the database, which might be arbitrary. Gridfit places the firstdata item found in the database at its correct position, and movessubsequent overlapping data points to nearby free positions,making their placement quasirandom.

Cartograms are another common technique dealing withadvanced map distortion.2 Cartogram techniques let data analyststrade shape against area and preserve the map’s topology toimprove map visualization by scaling polygonal elementsaccording to an external parameter. Thus, in cartogramtechniques, the rescaling of map regions is independent of a localdistribution of the data points. A cartogram-based map distortionprovides much better results, but solves neither the overlap nor thepixel coherence problems. Even if the cartogram provides a perfectmap distortion (in many cases, achieving a perfect distortion isimpossible), many data points might be at the same location, andthere might be little pixel coherence. Therefore, cartogram-baseddistortion is primarily a preprocessing step.

References1. D.A. Keim and A. Herrmann, “The Gridfit Algorithm: An Efficient and Effec-

tive Approach to Visualizing Large Amounts of Spatial Data,” Proc. IEEEVisualization Conf., IEEE CS Press, 1998, pp. 181-188.

2. D.A. Keim, S.C. North, and C. Panse, “Cartodraw: A Fast Algorithm forGenerating Contiguous Cartograms,” IEEE Trans. Visualization and Com-puter Graphics (TVCG), vol. 10, no. 1, 2004, pp. 95-110.

(a)

(b)

Page 5: Visual Data Mining in Large Geospatial - UFPEjtalr/Mestrado/01333626.pdf · Visual Data Mining in Large Geospatial Point Sets. IEEE Computer Graphics and Applications 37 Visual Exploration

� provides a qualitative overview of data, letting ana-lysts isolate unexpected phenomena for further quan-titative analysis.

Basic visualization techniqueWe use PixelMaps5 as our basic visualization tech-

nique. PixelMaps rescales map subregions to better fitdense, nonuniformly distributed points to unique outputpositions. The technique is novel in at least two ways:

� It provides meaningful and intuitive graphical repre-sentations of large data sets.

� It combines well-founded clustering algorithms withpixel-oriented visualization, thus exploiting a com-puter’s data processing and graphics power and theflexibility, creativity, and domain knowledge ofhuman data analysts.

PixelMaps aims to represent dense areas while pre-serving some of the key structures of the original geo-graphical space a x

i, a yi, and to allocate all data points to

unique display pixels, even in dense regions. To provide nonoverlap pixel displays, PixelMaps fol-

lows a four-step process.

Density-based map distortion. PixelMaps usesrecursive partitioning to approximate equal density inthe two geographical dimensions a x

i, a yi. Splitting the

data set at low-density positions (less than 10 percentof (l + r)/2 of the data points) achieves efficient parti-tioning (gridfile-like operations). Applying every splitto two areas with an equal number of points but differ-ent input screen space determines the map’s distortion.In the first split, for example, PixelMaps considers two

areas that each have about 50 percent of the data pointsbut unequal screen space. It then applies distortion togive each half of the data equal area in the output map.

Allocation and scaling. For efficient rescaling, weperform quadtree split operations on the extent of the2D screen space, causing empty areas to shrink anddense areas to expand.

We propose a new data structure to simultaneouslymanage allocation and scaling of both data and screenspace. It combines gridfiles (to manage input point par-titioning) and quadtrees (to manage new screen spacepositions). The computed rescaling reduces the size ofvirtually empty regions, reallocating the unused pixelsto dense regions. Figure 5 illustrates the rescaling of cer-tain map regions.

Array-based clustering. PixelMaps next computes anarray-based clustering of each partition. It divides thethird (statistical) dimension into intervals, from minimalto maximal value. The number of intervals depends onthe application scenario, and can be user specified. Pix-elMaps data structure stores each interval’s end points inan array. Each interval corresponds to a class (incomeclass, for example) and can be quickly determined foreach statistical value using a binary search. PixelMapsthen colors pixels according to cluster class indices.

Cluster positioning heuristic. Finally, after rescal-ing and clustering, PixelMaps assigns data points to pix-els, starting with the densest regions and choosing thesmallest cluster in each region first. Figure 6 shows ourcluster-positioning heuristic. To determine the place-ment sequence, we sort all final partitions (leaves of the

Visual Analytics

40 September/October 2004

5 Rescalingreduces the sizeof virtuallyempty regions,reallocating theunused pixels todense regions.We created thisseries by mov-ing Waldo’sdistortion slider.

Page 6: Visual Data Mining in Large Geospatial - UFPEjtalr/Mestrado/01333626.pdf · Visual Data Mining in Large Geospatial Point Sets. IEEE Computer Graphics and Applications 37 Visual Exploration

PixelMaps data structure) by the number of data pointscontained.

The pixel placement step provides visualizations thattrade off position, distance, and cluster preservation.

Exploratory data analysisVisual data exploration involves three steps in a

process so common that researchers have called it theinformation-seeking mantra6:

� Overview—an analyst examines a summary of thedata;

� Zoom and filter—the examination might reveal inter-esting patterns or data subsets meriting further inves-tigation; and

� Details on demand—the analyst focuses on the pat-terns identified in the previous step, inspecting detailsto form or validate hypotheses.

A PixelMaps overview of geospatial data reveals sub-sets with interesting structures by allocating larger dis-play areas to dense regions with many potentiallyinteresting subsets and smaller areas to less interestingitems. PixelMaps provides the basicvisualization technology in Waldoand bridges the gap between thethree visual exploration steps.

Visual exploration using Waldoresembles a hypothesis-generationprocess: PixelMaps lets analysts gaininsight into data and thereby devel-op and confirm new hypotheses. Tocomplement visualization, we canuse automatic techniques from sta-tistics, pattern recognition, ormachine learning to verify thehypotheses.

Interaction with PixelMapsWaldo uses several relevant inter-

action techniques to adjust the visu-al representation of data to suit thetask at hand.

First, relate and combine lets ana-lysts display data from several mapsin multiple linked views, often withidentical coordinate systems. Sec-ondary statistical parameters typi-cally appear on alternative maps, with data points atthe same positions but colored by other parameters.This makes it easy to compare parameters and detectlocal correlations, dependencies, and similar patterns.

Next, interactive distortion sliders let analysts adjustthe level of detail to change the distortion level. Figure5 shows the effect of changing spatial distortion.

A selection mechanism lets analysts isolate a subset ofthe displayed data for further processing, such as high-lighting, filtering, or quantitative analysis. Analysts canselect data on the visualization itself (direct manipula-tion) or through dialog boxes and other queries (indi-rect manipulation).

Finally, linking and brushing lets analysts relate select-

ed items to their representations in other views. Forexample, an analyst might compare points in PixelMapsto traditional displays such as 2.5D aggregated plots andbar maps.

Application examplesAn important issue in visual data mining is determin-

ing the effectiveness of the proposed visualizations. Ourevaluation compares PixelMaps displays with tradition-al approaches and provides examples using censusrecords and a telephone call volume data set.

Figure 7 shows a zoomed view of New York using atraditional map and a PixelMap made by Waldo. Thedegree of overlap for a 1,200 × 1,200 screen resolution

IEEE Computer Graphics and Applications 41

DDaattaa : P : data points belonging the same partition P;DS: Display Space

RReessuulltt PixelMap:ffoorr Pi ∈ P ddoo

iiff ||Pi||< min ∧ Var(Pi, Cntrd(Pi))>√||Pi||tthheenn

CNoise ← CNoise ∪ Pi;eellssee

C ← C ∪ Pi; C ← sort C acc Pi with Pi ∈ C; ffoorr Ci ∈ C ddoo

iiff Ci pixels are free around Cntrd(Ci) in DS tthheennDS ← SetPixels(Ci, Cntrd(Ci));

eellssee/* Find Closest Free Pixels */;fp ← FndClsstFrPxls(Ci, Cntrd(Ci), DS);DS ← SetPixels(Ci, fp, DS);

for Ci ∉ CNoise ddooiiff DS[pos(p)] == 0 tthheenn

DS ← SetPixel{p, pos(p), DS};eellssee

/* Find Closest Free Pixel */;fp ← FndClsstFrPxl(p, pos(p), DS);

7 Traditional map versus PixelMaps displays using New York state interest and dividendincome data for 2000.

6 Cluster-positioning heuristic.

Low High

Page 7: Visual Data Mining in Large Geospatial - UFPEjtalr/Mestrado/01333626.pdf · Visual Data Mining in Large Geospatial Point Sets. IEEE Computer Graphics and Applications 37 Visual Exploration

is 0.82 for the region. We based both visualizations onUS Census Interest and Dividend Household Income data.

The traditional map provides random results in areaswith a high degree of overlap (Manhattan, for example)but leaves sparsely populated areas virtually empty. Pix-elMaps increases space allotted to the densest regionsso all data points can be close together. We ran Pix-elMaps on the most detailed data we have at the censusblock level. To demonstrate its scalability, we createdindividual data points for each household, initially plac-ing them at the block centers. As the figure shows, clus-ters of households with very high investment incomeare in Manhattan and Queens, and households with lowinvestment income are in the Bronx and Brooklyn. Asalient cluster of wealthy households are on the east sideof Central Park.

Census demographic analysisWe performed a census demographics analysis using

data sets from the US Census Bureau. For the analysis,we extracted household income, investment income, andthe asking price of vacant homes for every state in the US.

The average number of data points assigned to thesame position in each state’s input data set heavily influ-ences PixelMaps performance, as Figure 8 shows.

California, Texas, New York, andFlorida had the most pointsassigned to the same position andwere therefore the most interestingstates for PixelMaps. For these fourstates, we ran PixelMaps in suitabletime (less than 20 seconds) for anefficient data exploration; for allother states we ran PixelMaps in realtime. We ran the experiments on a2.4-GHz Intel Xeon computer witha 4-Gbyte main memory.

As Figure 9 shows, householdincome is strongly correlated toinvestment income. The figure alsoshows that California has only a fewvacant homes with low or mediumasking prices (blue areas indicatenonvacant homes) and that NewYork has a few vacant homes in lessdesirable neighborhoods with lowerasking prices. Florida has relativelymore vacant homes, and the priceasked for these houses is stronglycorrelated with household incomein these areas. Although, medianhousehold income and investmentincome are strongly correlated ineach state—in particular, wealthyhouseholds are noticeable on theeast side of Central Park, on Flori-da’s Gold Coast, and on the Califor-nia coast.

A detailed analysis of PixelMapsefficiency and effectiveness withrespect to the defined visual explo-ration goals is available elsewhere.5,7

Call volume analysisMarketing analysts and network engineers look for

interesting patterns in network usage data to help themrecognize and respond to changing conditions quickly.One of Waldo’s key motivations is the need to analyzeextremely large customer service data sets. The exam-ple visualization in Figure 10 shows the call volume ofa telephone service during a 24-hour period.

The traditional map (Figure 10a) gives random resultsin areas with a high degree of overlap while leavingsparsely populated areas virtually empty. The Waldovisualizations (Figures 10b−d) show the advantages ofthe PixelMaps algorithm. The maps show that New YorkCity and Los Angeles County are the population areaswith the highest call volume in the US. The PixelMapsdisplays can show the local distribution of call volumesin these regions.

ConclusionDetecting interesting local patterns in large data

sets is a key research challenge. Particularly chal-lenging today is finding and deploying efficient andscalable visualization strategies for exploring largegeospatial data sets. One way is to share ideas from

Visual Analytics

42 September/October 2004

0 50,000 100,000 150,000 200,000 250,000

05

1015

(Cumulated time: 83.411 seconds; total data points assigned to the same position: 1,472,687)Number of data points assigned to the same position

Pixe

lMap

s co

mp

utat

ion

time

(sec

.)

AlabamaArizona

California

ColoradoConnecticut

Florida

Georgia

Idaho

Illinois

KentuckyLouisiana

Maine

MarylandMassachusetts

Michigan

MinnesotaMissouri

MontanaNebraskaNevada

ew_Hampshire

New_Jersey

New_Mexico

New_York

North_Carolina

Ohio

OklahomaOregon

Pennsylvania

South_Carolina

outh_Dakota

Tennessee

Texas

UtahVermont

Virginia

Washington

Wisconsin

Wyoming

8 Computation time based on the number of points assigned to the same xy-position using astandard screen resolution. Most PixelMaps can be computed in less than 5 seconds. In case ofhigh overplotting, the computation time is suitable for efficient data exploration.

Page 8: Visual Data Mining in Large Geospatial - UFPEjtalr/Mestrado/01333626.pdf · Visual Data Mining in Large Geospatial Point Sets. IEEE Computer Graphics and Applications 37 Visual Exploration

the statistics and machine-learning disciplines withideas and methods from the information and geovi-sualization disciplines. PixelMaps in the Waldo sys-tem demonstrates how data mining can besuccessfully integrated with interactive visualiza-tion. The increasing scale and complexity of dataanalysis problems will require tighter integration of

interactive geospatial data visualization with statis-tical data-mining algorithms.

Further information on visual analysis of massivegeospatial data sets, as well as an implementation of thePixelMaps algorithm and Waldo, is available at the Pix-elMaps Project Web site at http://dbvis.inf.uni-konstanz.de/~sips/pixel_based_dm/. �

IEEE Computer Graphics and Applications 43

9 PixelMapsresults for USCensus demo-graphics analy-sis showinginterest divi-dends income,median house-hold income,and price askedfor (a) Califor-nia, (b) Texas,(c) New York,and (d) Florida.

Low High

10 Call volumeanalysis using(a) traditionalmaps and (b−d)PixelMaps withincreasingscreen resolu-tion: (b) 800 ×347 pixels, (c)1,024 × 445pixels, and (d)1,600 × 695pixels.

(a) (b)

(c) (d)

Low High

(a) (b) (c) (d)

Page 9: Visual Data Mining in Large Geospatial - UFPEjtalr/Mestrado/01333626.pdf · Visual Data Mining in Large Geospatial Point Sets. IEEE Computer Graphics and Applications 37 Visual Exploration

AcknowledgmentsThe Information Society Technologies Programme

of the European Commission, Future and EmergingTechnologies, partially funded this work under the IST-2001-33058 PANDA project (2001-2004). We thankWaldo Tobler for his very helpful comments, DaveBelanger and Mike Wish for encouraging this investi-gation, and Eleftherios Koutsofios for providing dataand other assistance.

References1. A.S. Fotheringham and P. Rogerson, Spatial Analysis and

GIS, Taylor and Francis, 1994.2. K. Koperski, J. Adhikary, and J. Han, “Spatial Data Mining:

Progress and Challenges,” Research Issues on Data Miningand Knowledge Discovery, ACM Press, 1996.

3. D.A. Keim et al., “Pushing the Limit in Visual Data Explo-ration: Techniques and Applications,” Proc. Advances inArtificial Intelligence, 26th Ann. German Conf. AI, LNAI2821, Springer-Verlag, 2003, pp. 37-51.

4. D.A. Keim, C. Panse, and M. Sips, “Information Visualiza-tion: Scope, Techniques, and Opportunities for Geovisual-ization,” to be published in Exploring Geovisualization, J.Dykes, A. MacEachren, and M.-J.Kraak, eds., Elsevier,2004, pp. 15-44.

5. D.A. Keim et al., “PixelMaps: A New Visual Data MiningApproach for Analyzing Large Spatial Data Sets,” Proc. 3rdIEEE Int’l Conf. Data Mining (ICDM 03), IEEE CS Press,

2003, pp. 565-568.6. B. Shneiderman, “The Eyes Have It: A Task by Data Type

Taxonomy for Information Visualizations,” Proc. IEEE Visu-al Languages Conf., IEEE CS Press, 1996, pp. 336-343.

7. D.A. Keim et al., “Pixel Based Visual Mining of GeospatialData,” Computers and Graphics (CAG), vol. 28, no. 3, June2004, pp. 327-344.

Daniel A. Keim is a professor ofcomputer science at the University ofConstance, Germany. His researchinterests include information visual-ization and data mining. Keim has aPhD in computer science from theUniversity of Munich. He is an editor

of the IEEE Transactions on Visualization and Comput-er Graphics, the IEEE Transactions on Knowledge andData Engineering, and the Palgrave Information Visual-ization Journal.

Christian Panse is pursuing aPhD in the Data Mining and Visual-ization Group at the University ofConstance. His research interestsinclude visual data mining on largespatial data and cartogram drawing.Panse has an MS in computer science

from the Martin-Luther-University Halle-Wittenberg, Ger-many. He is a member of the IEEE Computer Society.

Mike Sips is completing his PhDstudies in the Data Mining and Visu-alization Group at the University ofConstance. His research interestsinclude visual data mining on largespatial data, spatial data transfor-mation, information visualization,

and advanced visual interfaces. Sips has an MS in com-puter science from the Martin-Luther-University Halle-Wittenberg. He is a member of the IEEE Computer Society,the ACM, and the German Society for Informatics.

Stephen C. North is head of Infor-mation Visualization Research atAT&T Labs. His research interestsinclude software visualization,applied computational geometry,reusable software design, dynamicand large-scale graph layout, and

spatial data transformation. North has a PhD in comput-er science from Princeton University. He is a senior mem-ber of the IEEE and a member of the ACM.

Readers may contact Daniel A. Keim at Dept. of Com-puter and Information Science, Univ. of Konstanz, Univer-sitatsstr. 10, Box D78, D78457 Konstanz, Germany;[email protected].

Visual Analytics

44 September/October 2004

Help

shape

the IEEE

Computer

Society of

tomorrow.

Vote for 2005 IEEE

Computer Society officers.

Polls open 13 August –

6 October

www.computer.org/election/