a cartographic approach to visualizing conference abstractskclarke/g200b/ieeecga2002.pdf · of...

9
T he map metaphor has become popular in recent works on information visualiza- tion. However, as we attempt to visualize information spaces in a low-dimensional display space, crucial impediments to the metaphor’s usefulness appear: pro- jection techniques break down because of a lack of scalability, visu- alizations suffer from graphic com- plexity, and labels are imbued with too little interpretable meaning and are hard to position without con- flict. 1 Cartographers have tradi- tionally addressed many of these problems, albeit in a strictly geo- graphic context. They’ve found numerous ways to represent a given portion of the infinitely complex earth surface on a finite map sur- face, be it on paper or on a comput- er screen. With this in mind, I present a dis- tinctly cartographic approach to mapping nongeographic informa- tion. Focusing on the text content of a set of conference abstracts, we can derive 2D visualizations of information spaces that address complexity and automation. Cartographically informed abstraction Apart from the depiction of geographic space itself, information visualizations rarely make use of carto- graphic principles, particularly with respect to issues of graphic complexity. There are hopeful signs, however, that this might be changing as evidenced by Foley’s list of “The Ten Top Problems Left” in computer graphics: When we want to create an abstraction that con- veys key ideas while suppressing irrelevant detail, we need to draw on the vast knowledge of car- tographers and animators. 2 Cartographers and geographers are now involved in nongeographic information visualization in several ways. Combined with the influence of cognitive lin- guists, the desire to extend certain geographic notions and principles to nongeographic information drives much of this work. 3,4 This refers particularly to ques- tions regarding the nature of geographic space, the objects that inhabit it, and the ways in which humans conceptualize it. In an earlier paper, I discussed a broad range of geo- graphic considerations and cartographic techniques as they relate to the visualization of text documents. 1 Here, I show how we can further improve map-like visualiza- tions of nongeographic information. The principal approach and specific techniques I use relate to noted document visualization efforts such as ThemeScapes, 5 ET-Map, 6 Depict, 7 and WebSOM. 8 I attempt to create visualizations that subjectively look like maps, thereby forcing the audience to use the same cognitive skills typ- ically associated with geographic maps. The experiment that I describe in this article is based on a set of 2,220 abstracts submitted to the Annual Meeting of the Association of American Geographers (AAG), held in Honolulu, Hawaii, in March 1999. The whole range of the geography discipline is represented at the annual meeting. Ideally, a visual representation of the corpus of abstracts should thus paint a fairly com- prehensive picture of the current state of the field. Infor- mation visualization should also convey valuable insight into the status and semantic relationships of the vari- ous research interests represented by the AAG’s 50 or so specialty groups. My method is informed by the way in which we derive topographic maps and some thematic maps. First, we create a detailed base map in which each element of an information space occupies a discrete 2D position. As the available display surface decreases—cartographers refer to this as scale reduction—generalized versions are cre- ated so that graphic density is reduced and high-level structures of the depicted space are brought to the fore. 0272-1716/02/$17.00 © 2002 IEEE Information Visualization 50 January/February 2002 A cartographic approach to mapping nongeographic information helps to manage graphic complexity in visualizations. It aids domain comprehension by forcing us to use the same cognitive skills we use when viewing geographic maps. André Skupin University of New Orleans A Cartographic Approach to Visualizing Conference Abstracts

Upload: others

Post on 28-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Cartographic Approach to Visualizing Conference Abstractskclarke/G200B/IEEECGA2002.pdf · of “The Ten Top Problems Left” in computer graphics: When we want to create an abstraction

The map metaphor has become popular inrecent works on information visualiza-

tion. However, as we attempt to visualize informationspaces in a low-dimensional display space, crucialimpediments to the metaphor’s usefulness appear: pro-

jection techniques break downbecause of a lack of scalability, visu-alizations suffer from graphic com-plexity, and labels are imbued withtoo little interpretable meaning andare hard to position without con-flict.1 Cartographers have tradi-tionally addressed many of theseproblems, albeit in a strictly geo-graphic context. They’ve foundnumerous ways to represent a givenportion of the infinitely complexearth surface on a finite map sur-face, be it on paper or on a comput-er screen.

With this in mind, I present a dis-tinctly cartographic approach tomapping nongeographic informa-tion. Focusing on the text content ofa set of conference abstracts, we can

derive 2D visualizations of information spaces thataddress complexity and automation.

Cartographically informed abstractionApart from the depiction of geographic space itself,

information visualizations rarely make use of carto-graphic principles, particularly with respect to issues ofgraphic complexity. There are hopeful signs, however,that this might be changing as evidenced by Foley’s listof “The Ten Top Problems Left” in computer graphics:

When we want to create an abstraction that con-veys key ideas while suppressing irrelevant detail,we need to draw on … the vast knowledge of car-tographers and animators.2

Cartographers and geographers are now involved innongeographic information visualization in severalways. Combined with the influence of cognitive lin-guists, the desire to extend certain geographic notionsand principles to nongeographic information drivesmuch of this work.3,4 This refers particularly to ques-tions regarding the nature of geographic space, theobjects that inhabit it, and the ways in which humansconceptualize it.

In an earlier paper, I discussed a broad range of geo-graphic considerations and cartographic techniques asthey relate to the visualization of text documents.1 Here,I show how we can further improve map-like visualiza-tions of nongeographic information. The principalapproach and specific techniques I use relate to noteddocument visualization efforts such as ThemeScapes,5

ET-Map,6 Depict,7 and WebSOM.8 I attempt to createvisualizations that subjectively look like maps, therebyforcing the audience to use the same cognitive skills typ-ically associated with geographic maps.

The experiment that I describe in this article is basedon a set of 2,220 abstracts submitted to the AnnualMeeting of the Association of American Geographers(AAG), held in Honolulu, Hawaii, in March 1999. Thewhole range of the geography discipline is representedat the annual meeting. Ideally, a visual representation ofthe corpus of abstracts should thus paint a fairly com-prehensive picture of the current state of the field. Infor-mation visualization should also convey valuable insightinto the status and semantic relationships of the vari-ous research interests represented by the AAG’s 50 or sospecialty groups.

My method is informed by the way in which we derivetopographic maps and some thematic maps. First, wecreate a detailed base map in which each element of aninformation space occupies a discrete 2D position. As theavailable display surface decreases—cartographers referto this as scale reduction—generalized versions are cre-ated so that graphic density is reduced and high-levelstructures of the depicted space are brought to the fore.

0272-1716/02/$17.00 © 2002 IEEE

Information Visualization

50 January/February 2002

A cartographic approach to

mapping nongeographic

information helps to

manage graphic complexity

in visualizations. It aids

domain comprehension by

forcing us to use the same

cognitive skills we use when

viewing geographic maps.

André SkupinUniversity of New Orleans

A CartographicApproach toVisualizingConferenceAbstracts

Administrator
Copyright 2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Page 2: A Cartographic Approach to Visualizing Conference Abstractskclarke/G200B/IEEECGA2002.pdf · of “The Ten Top Problems Left” in computer graphics: When we want to create an abstraction

In traditional cartography, this is done through a com-bination of geometric and conceptual operations. In thisresearch I produce generalized versions by merging indi-vidual features into groups through hierarchical clus-tering, based on feature similarity. Since the computationof the original base map is also driven by feature simi-larity, the high-dimensional merging of features mani-fests itself as a merging of geometric elements in the 2Ddisplay surface. Differences between various levels ofabstraction are reflected not only in changing geometricconfigurations but also in the text labels associated withindividual and clustered features. Rich, scale-dependentlabeling helps observers make sense of the many seman-tic facets inherent in a document corpus.

Base map creationCreation of the base map is driven by the desire to

express document similarity through geometric prox-imity. At the same time, the computational proceduresshould be scalable toward larger document collectionsand result in a geometric structure that easily supportsfeature aggregation across large-scale ranges. The resultis a methodology for transforming a document corpusinto a tessellated configuration in which each documentcorresponds to a unique polygon.

Keyword index and vector-space modelContent-based visualization of text documents typi-

cally starts with creating a keyword index. This remainsone of the most problematic issues, especially if an indexis to be created fully automatically, without prior knowl-edge of the particular domain and if entries are to berestricted to the kind of meaningful terms that humanindexers would choose.

The procedure that I describe allows the fully auto-mated creation of an index while restricting the key-word set to relatively meaningful terms. However, it’srestricted to information spaces in which the vast major-ity of documents have author-chosen keywords associ-ated with them. Conference abstracts fall into thiscategory. Authors of conference abstracts are usuallyasked to provide between three and five keywords.

While these keywords are a prerequisite of the index-ing process, I use them here in a manner that differs fromother approaches. Instead of forming the index itself, Ibroke the author-chosen keywords into single words. Forexample, I’d break an original keyword of “glacial geo-morphology” into the components “glacial” and “geo-morphology.” The exclusive use of keyword componentschosen by the authors ensures that a meaningful indexcan later be created without major processing effort andwithout resorting to more advanced methods aimed atthe extraction of meaning-bearing terms. Although Iused some stemming (that is, I reduced words to suchstems as “geogr” or “geol”), I retained the original termsfor the most part, which admittedly leads to some dupli-cation of meaning when such words as “forest” and“forests” are treated as separate terms.

I then matched keyword components against the fulltext of all abstracts to create the actual index. For eachabstract, I recorded whether and how many times it con-tains a certain keyword component. This approach has

two major advantages over the exclusive use of author-chosen keywords for each abstract. First, it allows anautomatic indexing of those abstracts that aren’t accom-panied by any keywords, as long as most authors addkeywords to their abstracts. (Authors of 174 abstracts, or7.8 percent, didn’t submit any keywords.) Second, myapproach leads to a richer vector-space model thatallows more differentiation when comparing the arti-cles’ content. For example, two abstracts sharing both“glacial” and “geomorphology” would express more sim-ilarity than two abstracts that have only one of thosekeywords in common.

To express the relationship between a set of keywordsand a document corpus, we can use the vector-spacemodel approach widely publicized by the work of Ger-ard Salton. In my experiment, I created a term-docu-ment matrix filled with the raw keyword counts for eachdocument. A vector of term counts thus takes the placeof the full-text document for all further processing.9

Self-organizing maps for documentvisualization

Self-organizing maps (SOMs), also known as Koho-nen maps, are commonly used to process the kind ofhigh-dimensional vector space model presented here(see the sidebar “Self-Organizing Maps”, next page).Use of the SOM method for document visualization wasfirst demonstrated 10 years ago.10 The scalability of thetechnique for large data sets and the attractiveness ofthe resulting visualizations have continued to spark theinterest of numerous researchers working on documentvisualization.6-8

Input to the SOM method consists of a matrix con-taining objects (rows) and their respective attribute val-ues (columns). When applied to document spaces, thiscan correspond to a term-document matrix, as I dis-cussed earlier. The number of terms associated with doc-uments can vary widely. In the experiment’s data set itranged from three to 57 keywords (Figure 1, next page).This can have unintended consequences whenever weuse Euclidean and similar metric measures, as is typi-cally the case for SOM algorithms. Documents with fewkeywords will tend to be drawn together while thosewith many keywords will be pushed apart, despite actu-al keyword matches that might occur between short andlong documents. After testing a number of normaliza-tion schemes, I addressed this problem by filtering thekeyword index around the mean keyword count. Iremoved 1,052 documents containing less than 18 ormore than 28 keywords from the term-document matrixprior to SOM training, leaving 1,148 documents.

The training phase is the most time-consuming por-tion of the SOM method. Although users choose thenumber of iterations, they usually range from severalthousand for small SOMs to several hundred thousandfor very high-dimensional data sets. The result of thelearning process is a 2D, raster-like representation of n-dimensional term space, in which raster elements cor-respond to individual neurons.

The time it takes to train a SOM and the way in whichit can be used depend on its size, or the number of nodesit has. We can train a small SOM much faster, which

IEEE Computer Graphics and Applications 51

Page 3: A Cartographic Approach to Visualizing Conference Abstractskclarke/G200B/IEEECGA2002.pdf · of “The Ten Top Problems Left” in computer graphics: When we want to create an abstraction

amounts to an unsupervised clustering of documents.For example, a SOM consisting of 4-by-4 nodes woulddivide documents into up to 16 clusters. This division ofdocuments resembles k-means clustering, whose objec-tive function is identical to Kohonen’s algorithm. Manydocument visualizations involve the creation of rela-tively small SOMs. Lin trained a SOM of 140 nodes,10

while the Depict system described by Rushall and Illgen

had a default size of 400 nodes.7

Training a much finer SOM, par-ticularly with high-dimensionaldata, can increase computing timesby several orders of magnitude. Atthe extreme end there’s the Web-SOM project, which included thecreation of a SOM for almost sevenmillion patent abstracts. The Web-SOM research group reports thatmultistep training of more than onemillion nodes took about six weekson a six-processor system.8

For this experiment, I trained aSOM of relatively fine resolution tocreate a base configuration in which

the SOM recognizes and geometrically preserves manyof the finer differences among abstracts. I trained a SOMconsisting of 4,800 nodes (Figure 2) using the filteredindex of 1,148 documents and 741 keyword compo-nents. The training stage alone, without subsequentclustering and visualization, took three hours on a SunUltra 1 workstation (200 MHz, 128 Mbytes), usingSOM_PAK 3.1.

Information Visualization

52 January/February 2002

Self-Organizing MapsThe self-organizing map (SOM) method is one of

the more widely used forms of artificial neuralnetworks (ANN). Input to the technique is a set ofn-dimensional observations. The output layerconsists of a network of neurons or nodes, whichare typically arranged as a 2D lattice, forming aconfiguration akin to raster data models. Nodes areconnected to their neighbors according to either asquare or hexagonal neighborhood definition (seeFigures 2 and 3a). Contrary to other ANNapproaches, there’s no hidden layer. Associatedwith each neuron is a reference vector of the samedimensionality as the input data. Input vectors trainthe neuron grid so that topological relationshipsamong input observations are preserved.Adjustments to the reference vector of a particularneuron aren’t done in isolation but propagate to itsnetwork neighbors. Training a SOM can be time-consuming if the dimensionality of input vectorsand/or the number of network nodes are high.

We can use the trained SOM in several ways. Forexample, we can visualize individual vectorcomponents to see areas of the map that are ofparticular significance. A popular form ofvisualization is known as the U-matrix method, inwhich a focal operator computes differencesamong neighboring reference vectors. Theresulting patterns can be interpreted as clusters.We can also compute clusters more explicitly byapplying standard clustering techniques to all theneurons. New observations can be mapped ontothe trained SOM quickly, since it only involvesfinding the best matching reference vector. Forexample, in the case of conference abstracts, we

could find out how our own research interests fitin with the rest of the conference program.

For an in-depth discussion of the SOM method,see Kohonen’s book.1 A number of collections ofSOM applications and case studies are alsoavailable.2,3

I based the visualization of conference abstractspresented in this article on a SOM trained withSOM_PAK 3.1, which is freely available from theLaboratory of Computer and Information Science,Helsinki University of Technology (http://www.cis.hut.fi/research/som_pak/). Final visualization,including automated label placement, was done inESRI ArcGIS 8.1 (http://www.esri.com/), sinceSOM_PAK is limited in terms of interactivity andgraphic output. SOM software also comes in theform of standalone, commercial packages, likeViscovery SOMine (http://www.eudaptics.com),which includes various visualization options. Finally,SOM functionality is increasingly available inconnection with existing statistical andmathematical software products. The NeuralNetwork Toolbox is an example (http://www.mathworks.com).

References1. T. Kohonen, Self-Organizing Maps, Springer, Berlin,

1995.2. G. Deboeck and T. Kohonen, Visual Explorations in

Finance with Self-Organizing Maps, Springer, London,1998.

3. E. Oja and S. Kaski, eds., Kohonen Maps, Elsevier Sci-ence, Amsterdam, 1999.

Number of keywords per document

0

Freq

uenc

y

5 10 15 20 25 30 35 40 45 50 55 600

20

40

60

80

100

120

1401 Conferenceabstracts canhave a widerange in thenumber ofkeywordcomponents.Documentsaround themean of thisdistribution areused to trainthe SOM.

Page 4: A Cartographic Approach to Visualizing Conference Abstractskclarke/G200B/IEEECGA2002.pdf · of “The Ten Top Problems Left” in computer graphics: When we want to create an abstraction

Visualization of individualdocuments

Once we establish the 2D config-uration of neurons, it can help usdetermine the 2D positions of doc-uments. Individual documents areassigned to the most similar neuronby comparing document vectors toneuron vectors. A single neuronmay become associated with mul-tiple documents. While this raster-type geometric configuration ofneurons is typical for most SOMs,the goal in this experiment is to

derive a set of discrete vector-typelocations for each document. Thisallows direct access to single docu-ments as well as to the attributesthat may be associated with them,such as author name, modificationdate, and so on. Such a representa-tion then serves as a kind of basemap of the information, fromwhich we can later derive general-ized versions (analogous to the roleof topographic base scales for thederivation of medium-scale topo-graphic maps).

The approach I propose herestarts with a SOM of relatively fineresolution (see Figures 2 and 3a). After applying thisSOM to a set of documents, neurons will have a varyingnumber of documents associated with them (Figure3b). I assign documents unique coordinate locations byrandomly distributing them within the boundaries ofthe respective neuron grid cell (Figure 3c). This resultsin a geometric configuration containing regions of vary-ing point density. To better enable the scale-dependentmerging of individual documents into clusters, I assignpoints distinct portions of the 2D display area by usingThiessen or Voronoi polygons (Figure 3d). I applied thisprocedure to the complete set of conference abstracts.The result is a base map of the document corpus, con-sisting of 2,220 polygons (Figure 4).

Cluster-based generalizationWhen applied to a large set of documents, repre-

sentations of the complete base map will quicklybecome too complex, especially if documents are to beaccompanied by meaningful label terms. As a result,we need to devise methods to simplify the 2D base mapso that content remains legible and meaningful at var-ious scales. In this work, I simplified the visualizationby merging neighboring document polygons if theywere part of a statistically determined high-dimen-sional cluster. Whichever clustering method we adopt,we want to let viewers employ geographic notions oftopology, proximity, clustering, and regionalization inthe visualization of an information space.

IEEE Computer Graphics and Applications 53

2 Geometricarrangement of60 × 80 neuronsfor a self-organizingmap. Eachneuron’s explicitconnection withits six neighborspreserves topo-logical relation-ships of theinput dataduring training.

3 Deriving a point visualization from a self-organizing map and partitioning of map space into distinct polygonsfor each document. (a) Geometry of a square-based SOM. (b) Multiple documents (observations) may be associat-ed with single neurons. Here, I indicate the number of observations per neuron. (c) Assignment of random loca-tions inside neurons. (d) Complete tessellation of map space with Thiessen polygons.

4 Geometricbase mapconfiguration ofconferenceabstracts.

(a) (b) (c) (d)

Page 5: A Cartographic Approach to Visualizing Conference Abstractskclarke/G200B/IEEECGA2002.pdf · of “The Ten Top Problems Left” in computer graphics: When we want to create an abstraction

Scale-dependence is another factor driving the choiceof a specific clustering approach. It should be possibleto control the amount of graphic detail presented tousers so that less detail implies a higher level of abstrac-

tion. The fewer clusters used to partition the completeinformation space, the smaller we can make the visual-ization’s scale (see the sidebar “Hierarchies in Scale-Dependent Visualization”).

Information Visualization

54 January/February 2002

Hierarchies in Scale-Dependent VisualizationFew approaches lend themselves to geometric and

semantic abstraction like a hierarchical organization ofinformation. Consider the administrative division ofgeographic space into countries, states or provinces, andcounties, parishes, or districts. The nature of this structureas a nested hierarchy is reflected in the cartographic meansused to visualize them, for example, by choosing symbolsthat convey a visual hierarchy. It’s also standard practice ingeographic information system (GIS) interfaces to definescale ranges within which certain layers will be eitherdisplayed or hidden.

Similarly, hierarchies can be great enablers for scale-dependent visualization of nongeographic information. Thetree map method1 was among the first to demonstrate this.A number of related products have found widespread use inrecent years in such application areas as stock marketvisualizations (see http://www.smartmoney.com/maps).Formal investigation and computational modeling of scaledependency in information visualization are still rare,though. When it’s done, as in the case of space-scalediagrams2 or the zoomable user interface of Pad++,3

similarities to the geographic treatment of scale intraditional cartography and modern GIS become obvious.

Hierarchical clusteringThe choice of hierarchical clustering for a scale-

dependent merging of individual features is natural. Manymethods for computing clustering trees have been widelyaccepted and understood for several decades4,5 and therespective hierarchical data structures are easy to build andmaintain. Linking a clustering tree with a 2D base mapallows control over the amount of detail shown at a givenzoom level. The four maps shown in Figure A illustrate howwe can manage graphic complexity in this manner. Basemap polygons are derived from point locations as Voronoidiagrams. Each map corresponds to a certain level ofaggregation in the clustering tree. In addition to suchzoom-oriented interaction, we can also explore data by

sliding up and down along the clustering tree andobserving the merging or splitting of clusters in a mapwindow of constant size.

Other hierarchiesMost zoomable information visualizations aren’t based on

the kind of computed hierarchies discussed in this article.They’re typically based either on existing structuralhierarchies or on constructed and managed contenthierarchies. File system structures are an example for theformer while Netscape’s Open Directory Project(http://www.dmoz.org) exemplifies the latter. ODP dataform the basis of a number of competing map-like Webvisualization interfaces (compare the interface at http://www.webmap.com to the one at http://maps.map.net). What allof these approaches have in common is that the 2D locationof features is primarily defined by their place in a nestedhierarchy, not by interfeature relationships.

References1. B. Johnson and B. Shneiderman, “Tree Maps: A Space-Filling

Approach to the Visualization of Hierarchical Information Struc-tures,” Proc. IEEE Visualization 91, IEEE CS Press, Los Alamitos,Calif., 1991, pp. 275-282.

2. G. Furnas and B. Bederson, “Space-Scale Diagrams: Understand-ing Multiscale Interfaces,” Proc. ACM Conf. Human Factors in Com-puting Systems (CHI 95), ACM Press, New York, 1995, pp.234-241.

3. B. Bederson and J. Hollan, “Pad++: A Zooming Interface forExploring Alternative Interface Physics,” Proc. ACM Symp. UserInterface Software and Technology (UIST 94), ACM Press, New York,1994, pp. 17-22.

4. P. Sneath and R. Sokal, Numerical Taxonomy: The Principles andPractice of Numerical Classification, W.H. Freeman, San Francisco,1973.

5. H. Späth, Cluster Analysis Algorithms for Data Reduction and Clas-sification of Objects, Halsted Press, New York, 1980.

2 clusters

4 clusters6 clusters

G F B I A C E H

A

C

H

E

I

G FB

A

C

EH

BI

G F

ACEH

BI

GF

EHAC

BIGF

Clustering tree Base map 6 clusters 4 clusters 2 clusters

A Use of hierarchical clustering to derive scale-dependent visualizations from a base map.

Page 6: A Cartographic Approach to Visualizing Conference Abstractskclarke/G200B/IEEECGA2002.pdf · of “The Ten Top Problems Left” in computer graphics: When we want to create an abstraction

In an interactive setting, users will first be presentedwith a highly generalized version. Zooming in will revealmore detailed information. One of the challenges in thisprocess is to not simply hide detail, but imbue each gen-eralized version with scale-dependent meaning. Eachstep of geometric merging must thus be accompaniedby the derivation of label terms appropriate to therespective scale level.

A complete hierarchical clustering solution based onthe computation of interdocument similarities wouldsuffer from the previously discussed wide range in thenumber of keywords per document. Instead, I comput-ed a complete clustering solution for the SOM’s 4,800neurons (Figure 5). I then based the membership of indi-vidual documents in neuron clusters on the associationof documents with neurons (see Figure 3c).

Feature labelingLabeling individual and grouped features is an inte-

gral part of every cartographic depiction. We should giveit equal attention when dealing with map-like informa-tion visualization.1 It can be challenging to label a doc-ument corpus meaningfully, particularly when the goalof the final visualization isn’t to identify documents butto understand the semantic structures and relationshipsamong documents.

Labeling individual documentsAs far as the vector-space model is concerned, the set

of keywords represents the document. When choosingmeaningful document labels automatically, the vector-space model suggests which label terms from the docu-ment vectors to select. The keyword that bestcharacterizes a document while distinguishing it fromother documents should be the label term. The termweighting formula in Equation 1 expresses this best,since it weighs global term frequency against the fre-quency of a term within a document:9

(1)

The weight of a term Tj in document Di is based on thenumber of occurences of the term within that document(tfij) and the log of the inverse document frequency

N is the total number of documents and dfj is the num-ber of documents in which the term appears.

The keyword with the highest weight for a particulardocument will become its label term. To express the con-tent of each document more meaningfully, I computedthree label terms for each document in descending orderof term weights.

Labeling document clustersCompared to labeling individual documents, it’s more

difficult to find a label term that’s both representative ofa cluster and sets it apart from other clusters. Depend-ing on the chosen algorithm, this may have consequencesbetween two extremes. One possibility would be that thechosen cluster label is specific enough to clearly distin-guish clusters from each other. In this case, there’s thedanger that the chosen term is really only contained in alimited number of documents within the cluster. Thisraises the question of whether it’s then still representa-tive of the cluster. Another possibility is that the clusterlabel is computed in a manner that ensures that a largeproportion of the cluster members actually contain thelabel term. This raises the likelihood that other clustersmight produce the same label. Therefore, the labelwouldn’t express the cluster’s distinct character.

The potential conflict between these two casesbecomes more apparent as clusters grow in size. It mayin fact be best to deploy two term weighting procedures,one for high-level clusters and another for low-level clus-ters. For high-level clusters, I found it sufficient to addup term counts within each cluster and choose the termwith the highest count to become the cluster label. Forlow-level clusters, I used a variant of the term weight-ing formula in Equation 1 that treated each cluster as asuperdocument containing all the terms of its memberdocuments.

VisualizationThe geographic merging of base-map polygons on the

basis of hierarchical clustering and the computation ofmeaning-bearing labels for the resulting clusters leavesus with a data set to which we can apply traditional prin-ciples of cartographic design. For example, when creat-ing a map of counties, we should also display higherlevel landmark boundaries such as state boundaries. Insuch strictly hierarchical systems, a higher level bound-ary always also coincides with a lower level boundary,which means that the added layer of information pro-vides important context without adding much visual

(log )N

dfj

w tf

N

dfij ij

j= log

IEEE Computer Graphics and Applications 55

5 Hierarchical clustering tree for 4,800 SOM nodes. Also shown are three horizontal cuts corresponding to a 10-cluster solution (blue), a 100-cluster solution (red), and an 800-cluster solution (green).

Page 7: A Cartographic Approach to Visualizing Conference Abstractskclarke/G200B/IEEECGA2002.pdf · of “The Ten Top Problems Left” in computer graphics: When we want to create an abstraction

complexity. When we apply this principle to nongeo-graphic data via hierarchical clustering, it provides forconsistency and context during geographic informationsystem (GIS)-like zoom operations (Figure 6). It alsomakes the simultaneous display of multiple abstractionlevels possible, since visual hierarchies can convey clus-ter hierarchies (Figure 7).

Individual scale-dependent levelsFigure 6 illustrates the relationship between the

amount of the displayed total base map area and thevisualized level of the hierarchical cluster solution.From left to right, users will first see an overview of theinformation space consisting of a 10-cluster solution(Figure 6a). Upon zooming in, they see a 100-clustersolution (Figure 6b), and finally an 800-cluster solu-tion (Figure 6c). The three highest ranked labelsaccompany every cluster.

Multilevel overlayCompared to other clustering techniques, the nested

hierarchy produced by hierarchical clustering is espe-cially advantageous when it comes to displaying a num-ber of different cluster levels simultaneously. However,we must choose map symbols carefully, so that the hier-archical structure is visually conveyed. Figure 7 showsa simultaneous visualization of three levels of the clus-tering tree, with 10, 25, and 100 clusters, respectively.Cluster labels are ranked according to their computedweight and scaled so that users are first drawn to thehigher ranked labels. The point locations of all abstractsare added as faint symbols in the background.

Discussion of resultsThe resulting visualizations (Figures 6 and 7) allow

us to ascertain some advantages and pitfalls of theapproach described in this article. In the absence of for-mal user testing, the following discussion focuses onissues related to the creation and manipulation of thedocument index and on an interpretation of domain-spe-cific structures and relationships from the visualizations.

TechniquePreprocessing of source data is one of the critical ele-

ments in providing successful information visualiza-

tions. In this experiment I decided to feed raw termcounts to the SOM training procedure. This makes forfairly direct interpretability of component values in thetrained SOM and allows the mapping of raw observa-tions onto the SOM. I’ve experimented with traditionalterm weighting schemes for the document index, suchas the one used for document labeling, but haven’t foundthe results to warrant the added complexity.

Document length normalization deserves to be revis-ited. The filtering of the training data set around themean keyword count led to improved results, comparedto previous experiments. Nevertheless, many of the doc-uments with a low keyword count still congregatedrather than being distributed across the map. At the clus-tering stage, this congregation tends to merge too easi-ly with neighboring regions, which leads to anincongruous cluster in the center of the map. Visually,this cluster grows as we move higher in the clusteringtree, as indicated by the labels landscape/popula-tion/sediment in the 10-cluster solution (Figure 7).

Further stemming would lead to a tighter word index,but we need to balance it with the desire to provide richcluster labeling. The appearance of the term “new” as acluster label (Figures 6a and 7) is originally caused bythe inclusion of such author-chosen keywords as “NewZealand” or “new world order.” After breaking these upinto individual components, all abstracts using the term“new” are indexed accordingly. Terms like this shouldbe added to the stop word list, which excludes certainhigh-frequency terms from the index.

InterpretationThe work presented here is part of a continuing effort

to explore the development and state of geographic sci-ence. One typical division of the discipline considers theexistence of three distinct areas of geographic work:

� Human geography: the study of the human environ-ment, within which urban, transportation, popula-tion, economic, and feminist geography are only someof a number of more specific areas of interest andapproaches.

� Physical geography: the geographic study of the nat-ural environment, including aspects of geomorphol-ogy, climate, vegetation, and so forth.

Information Visualization

56 January/February 2002

(a) (b) (c)

6 Three differ-ent zoom levelsin a visualiza-tion of confer-ence abstracts:(a) completemap shown in a10-cluster solu-tion and mapportions for (b) a 100-clusterand (c) 800-cluster solution.Higher levelboundaries areaccentuated toprovide contextduring zoomoperations.

Page 8: A Cartographic Approach to Visualizing Conference Abstractskclarke/G200B/IEEECGA2002.pdf · of “The Ten Top Problems Left” in computer graphics: When we want to create an abstraction

� Techniques: a term that has traditionally referred towork in GIS, cartography, and remote sensing. Ratherthan merely supporting the work of human and phys-ical geographers, geographic efforts in this area areincreasingly part of an emerging cross-disciplinaryresearch field known as geographic information sci-ence (GIScience).

Abstract authors tend to use the various subcategories(such as “climate”) to identify their particular researchtopic. The three-tier division emerges implicitly, as relat-ed areas of work are arranged in relative proximity. Asa result, human geography occupies the right half of thevisualization (Figures 6a and 7). Aspects of physicalgeography dominate the upper left quadrant. The pro-cessing and modeling of geographic data dominate thelower left quadrant. The cluster labeled geogr/new/research (Figure 6a) contains many abstracts that dealwith the teaching of geography, such as research intothe development of new teaching tools and techniques.Issues surrounding resource management dominate thetop of the map. This is a heterogeneous area at the inter-section of human and natural environments, rangingfrom the more urban–suburban policy and communityquestions (note how “land use” appears as “land” and“use” in Figure 7) on the right of Figure 7 to the man-agement of forest and water resources on the left.

Note that geographic concepts and topics dominate thehigher level clusters. The names of specific geographicfeatures (such as “Wyoming”) appear only as clusterlabels at lower levels of aggregation (see Figure 6c).

Combining similarity-based mapping and clusteringencompasses the respective strengths and mitigates

some of the problems associated with the two approach-es. Similarity-based geometric configurations lack theexplicit categories necessary for effective communica-tion. While statistical clustering can provide those cat-egories, it lacks the overarching context, particularlywith respect to intercluster relationships, that can beprovided with a map-like representation. For example,note how areas of geographic research dealing withwater tend to be drawn toward each other, even acrosscluster boundaries. That’s why a cluster labeled trans-port/sediment/suspended (center of Figure 7) is neara cluster labeled water/soil/wetland, across a bound-ary of two clusters already separated at the 10-clusterlevel. Population geography is positioned closer to thespatial/data/information cluster (bottom left in Figure7) because it’s the part of human geography in whichquantitative modeling and the use of spatially refer-enced data, like those provided by the US CensusBureau, are most prevalent.

ConclusionsThe procedures presented here could be integrated

into a fully automated system for visually exploring con-ference abstracts. One of the remaining issues concernsthe mechanism for matching the level of semantic andgeometric abstraction to the display scale. (I subjectivelychose the display scales for the figures for this article.)I’m currently working on using cartographic general-ization principles to determine proper display scalesautomatically, depending on abstraction level andgraphic density.

Thus far, I haven’t conducted formal subject testingto investigate how well these visualization techniques

IEEE Computer Graphics and Applications 57

7 Visualization of 2,220 conference abstracts with simultaneous overlay of three levels of a hierarchical clusteringtree of SOM neurons: 10-cluster solution (red); 25-cluster solution (green); and 100-cluster solution (black). Clusterlabels are scaled according to rank within the respective cluster.

Page 9: A Cartographic Approach to Visualizing Conference Abstractskclarke/G200B/IEEECGA2002.pdf · of “The Ten Top Problems Left” in computer graphics: When we want to create an abstraction

actually work. For this article I chose two extremes: asparse visualization with a focus on interactivity (Fig-ure 6) and a much richer form, with the potential forstatic output, leisurely study, and instigation for dis-cussion (Figure 7). In the end, actual solutions shouldprobably fall somewhere in between, providing userswith a rich and interactive, yet not overwhelming map.

One of the more persistent comments I’ve received sofar, particularly with respect to the multilevel overlay(Figure 7), is how reminiscent the results are of carto-graphic depictions of geographic space. The rich label-ing and intricacy of cluster outlines make it hard forsome to believe that they’re not actually looking at geo-graphic structures located on the earth’s surface. Thisis exactly the kind of reaction I’m hoping for, because itraises the possibility that users of a visualization systemwould be forced to use the same spatio-cognitive skillsthey employ when dealing with geographic maps. �

References1. A. Skupin, “From Metaphor to Method: Cartographic Per-

spectives on Information Visualization,” Proc. IEEE Symp.Information Visualization 2000 (InfoVis 2000), IEEE CSPress, Los Alamitos, Calif., 2000, pp. 91-97.

2. J. Foley, “Getting There: The Ten Top Problems Left,” IEEEComputer Graphics and Applications, vol. 20, no. 1,Jan./Feb. 2000, pp. 66-68.

3. H. Couclelis, “Worlds of Information: The GeographicMetaphor in the Visualization of Complex Information,”Cartography and Geographic Information Systems, vol. 25,no. 4, Oct. 1998, pp. 209-220.

4. S. Fabrikant and B. Buttenfield, “Formalizing SemanticSpaces for Information Access,” Annals of the Assoc. Amer-ican Geographers, vol. 91, no. 2, June 2001, pp. 263-280.

5. J. Wise et al., “Visualizing the Non-Visual: Spatial Analysisand Interaction with Information from Text Documents,”Proc. IEEE Symp. Information Visualization 1995 (InfoVis95), IEEE CS Press, Los Alamitos, Calif., 1995, pp. 51-58.

6. H. Chen, C. Schuffels, and R. Orwig, “Internet Catego-

rization and Search: A Self-Organizing Approach,” J. Visu-al Comm. and Image Representation, vol. 7, no. 1, Mar. 1996,pp. 88-102.

7. D. Rushall and M. Illgen, “DEPICT: Documents Evaluatedas Pictures,” Proc. IEEE Symp. Information Visualization1996 (InfoVis 96), IEEE CS Press, Los Alamitos, Calif.,1996, pp. 100-107.

8. T. Kohonen et al., “Self Organization of a Massive Text Doc-ument Collection,” Kohonen Maps, E. Oja and S. Kaski, eds.,Elsevier Science, Amsterdam, 1999, pp. 171-182.

9. G. Salton, Automated Text Processing: The Transformation,Analysis, and Retrieval of Information by Computer, Addi-son-Wesley, Reading, Mass., 1989.

10. X. Lin, “Visualization for the Document Space,” Proc. IEEEVisualization 92, IEEE CS Press, Los Alamitos, Calif., 1992,pp. 274-281.

André Skupin is an assistant pro-fessor of geography at the Universityof New Orleans. His research inter-ests include text document visualiza-tion, geographic visualization,cartographic animation and hyper-media, and cartographic generaliza-

tion. He received a Dipl.-Ing. degree in cartography fromthe Technical University Dresden, Germany (1992) and aPhD in geography from the State University of New Yorkat Buffalo (1998). He did graduate research with theNational Center for Geographic Information and Analy-sis (NCGIA) and has worked in the GIS industry in the US,Germany, and South Africa.

Readers may contact André Skupin at the Dept. of Geog-raphy, 261 Liberal Arts Bldg., Univ. of New Orleans, NewOrleans, LA 70148, email [email protected].

For further information on this or any other computingtopic, please visit our Digital Library at http://computer.org/publications/dlib.

Information Visualization

58 January/February 2002