a stack-based prospective spatio-temporal data analysis approach

17
A stack-based prospective spatio-temporal data analysis approach Wei Chang a, , Daniel Zeng b , Hsinchun Chen b a Katz Graduate School of Business, University of Pittsburgh, Pittsburgh, PA 15260, USA b Department of Management Information Systems, University of Arizona, Tucson, AZ 85721, USA Available online 17 December 2007 Abstract Spatio-temporal data analysis has recently gained considerable attention from both the research and practitioner communities because of the increasing availability of datasets with prominent spatial and temporal data elements. In this paper, we develop a new spatio-temporal data analysis approach aimed at discovering abnormal spatio-temporal clustering patterns. We also propose a quantitative evaluation framework and compare our approach against a widely used spacetime scan statistic-based method under this framework. Our approach is based on a robust clustering engine using support vector machines and incorporates ideas from existing online surveillance methods to track incremental changes over time. Initial experimental results using both simulated and real-world datasets indicate that our approach is able to detect abnormal areas with irregular shapes more accurately than the spacetime scan statistic-based method. © 2007 Elsevier B.V. All rights reserved. Keywords: Spacetime scan; Support vector machine; Algorithm design; Spatio-temporal surveillance method 1. Introduction Recent years have witnessed significant interest in spatio-temporal data analysis [13]. The main reason for this interest is the availability of datasets containing important spatial and temporal data elements across a wide spectrum of applications ranging from public health (disease case reports), public safety (crime case reports), search engines (search keyword geographical distributions over time), transportation systems (data from Global Positioning Systems (GPS)), to product lifecycle management (data generated by Radio Fre- quency Identification (RFID) devices), and financial fraud detection (financial transaction tracking data). Consider public health and public safety as exam- ples. In the U.S., records of infectious diseases are being tracked and reported regularly at both the state and federal levels [20]. For instance, the Centers for Disease Control and Prevention (CDC) has collected incident reports on 52 infectious disease types with spatial and temporal coordinates throughout the country since 1998. Recent significant health events such as the West Nile Virus and SARS outbreaks have motivated public health departments around the globe to use pre-diagnostic information to detect outbreaks at an early stage. The spatial and temporal information contained in the pre- diagnostic reports plays a pivotal role in such outbreak detection efforts. In public safety applications, most police departments track the location and time of each Available online at www.sciencedirect.com Decision Support Systems 45 (2008) 697 713 www.elsevier.com/locate/dss An earlier abridged version of this paper appeared in the Proceedings of the Fifteenth Annual Workshop on Information Technology and Systems (WITS'05). Corresponding author. E-mail addresses: [email protected] (W. Chang), [email protected] (D. Zeng), [email protected] (H. Chen). 0167-9236/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2007.12.008

Upload: wei-chang

Post on 04-Sep-2016

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: A stack-based prospective spatio-temporal data analysis approach

Available online at www.sciencedirect.com

45 (2008) 697–713www.elsevier.com/locate/dss

Decision Support Systems

A stack-based prospective spatio-temporal data analysis approach☆

Wei Chang a,⁎, Daniel Zeng b, Hsinchun Chen b

a Katz Graduate School of Business, University of Pittsburgh, Pittsburgh, PA 15260, USAb Department of Management Information Systems, University of Arizona, Tucson, AZ 85721, USA

Available online 17 December 2007

Abstract

Spatio-temporal data analysis has recently gained considerable attention from both the research and practitioner communitiesbecause of the increasing availability of datasets with prominent spatial and temporal data elements. In this paper, we develop anew spatio-temporal data analysis approach aimed at discovering abnormal spatio-temporal clustering patterns. We also propose aquantitative evaluation framework and compare our approach against a widely used space–time scan statistic-based method underthis framework. Our approach is based on a robust clustering engine using support vector machines and incorporates ideas fromexisting online surveillance methods to track incremental changes over time. Initial experimental results using both simulated andreal-world datasets indicate that our approach is able to detect abnormal areas with irregular shapes more accurately than the space–time scan statistic-based method.© 2007 Elsevier B.V. All rights reserved.

Keywords: Space–time scan; Support vector machine; Algorithm design; Spatio-temporal surveillance method

1. Introduction

Recent years have witnessed significant interest inspatio-temporal data analysis [13]. The main reason forthis interest is the availability of datasets containingimportant spatial and temporal data elements across awide spectrum of applications ranging from publichealth (disease case reports), public safety (crime casereports), search engines (search keyword geographicaldistributions over time), transportation systems (datafrom Global Positioning Systems (GPS)), to product

☆ An earlier abridged version of this paper appeared in theProceedings of the Fifteenth Annual Workshop on InformationTechnology and Systems (WITS'05).⁎ Corresponding author.E-mail addresses: [email protected] (W. Chang),

[email protected] (D. Zeng), [email protected] (H. Chen).

0167-9236/$ - see front matter © 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.dss.2007.12.008

lifecycle management (data generated by Radio Fre-quency Identification (RFID) devices), and financialfraud detection (financial transaction tracking data).

Consider public health and public safety as exam-ples. In the U.S., records of infectious diseases are beingtracked and reported regularly at both the state andfederal levels [20]. For instance, the Centers for DiseaseControl and Prevention (CDC) has collected incidentreports on 52 infectious disease types with spatial andtemporal coordinates throughout the country since 1998.Recent significant health events such as the West NileVirus and SARS outbreaks have motivated public healthdepartments around the globe to use pre-diagnosticinformation to detect outbreaks at an early stage. Thespatial and temporal information contained in the pre-diagnostic reports plays a pivotal role in such outbreakdetection efforts. In public safety applications, mostpolice departments track the location and time of each

Page 2: A stack-based prospective spatio-temporal data analysis approach

698 W. Chang et al. / Decision Support Systems 45 (2008) 697–713

reported criminal activity and incident [12]. Often policeofficers use the spatial and temporal coordinates of thesereports in conjunction with crime types to identify crime“hotspots” and then allocate resources (e.g., patrolling)accordingly.

In the literature, many different types of spatio-temporal data mining and knowledge discovery ap-proaches have been proposed in the last decade [5].These approaches are from a number of academicdisciplines including biostatistics, environmental statis-tics, Geographic Information Systems (GIS), datamining/machine learning, information visualization,among others. Roddick and Spiliopoulou [14] providean extensive bibliography of this body of literature.Gunopulos [6] characterizes geospatial data as datacontaining distance and topological information andrepresenting changes in the geolocation of objects overtime. He further distinguishes two types of geospatialdata analyses. In the first type, exploratory data mining,users query the data, strive to understand the overallpicture (e.g., through visualizing the data), or test somespecific hypotheses. The second type of analysis is aboutautomatically accomplishing specific goals such asclustering, classification, outlier detection, spatial asso-ciation rule learning, and trend detection.

Despite the significant advances made recently, manyexisting spatio-temporal data analysis approaches take astatic view of the geospatial phenomena [21]. Theseapproaches typically extract a set of data points withspatio-temporal information based on user-providedcriteria (e.g., the location and time of interest) andperform various types of analyses. The importantquestions that are answered by such approaches arehow to identify areas having exceptionally high or lowmeasures and how to determine whether the unusualmeasures can be attributed to known random variationsor are statistically significant. In many cases, however,this static or retrospective perspective is inadequate asdata often arrive dynamically and continuously, and inmany applications there is a critical need for detectingand analyzing emerging spatial patterns on an onlinebasis. In the statistical literature, methods that aim todetect abnormal patterns in an online context are calledprospective surveillance approaches. Such approachesmonitor observations with spatial elements continuouslyand disseminate alert messages when anomalies areidentified. Applications of prospective surveillancetechniques range from manufacturing process monitor-ing, infectious disease outbreak surveillance, to eventmonitoring for homeland security.

In this paper, we focus on prospective spatio-temporal data analysis. The main intended contribution

of this paper is to propose a new prospective method thatcan detect clusters with irregular shapes in a timelymanner, and develop a quantitative evaluation frame-work to compare this new algorithm against a widelyused technique based on the space–time scan statistic.The rest of the paper is organized as follows. Section 2surveys two major types of prospective surveillanceapproaches. A quick review of retrospective analysisand univariate surveillance approaches that haveprovided a technical foundation for these prospectiveanalysis methods is also provided. In Section 3, weintroduce a new prospective analysis method based on arobust support vector machine (SVM)-based spatialclustering technique. This model is able to overcomesome of the key computational problems faced byexisting approaches. Section 4 reports on computationalexperiments based on simulated datasets. This experi-mental study includes a comparative componentevaluating our approach against a popular space–timescan statistic-based approach and helps quantify theeffectiveness of our approach. In Section 5, wesummarize two case studies applying the proposedapproach to two real-world datasets. Section 6 con-cludes the paper by summarizing our research contribu-tions and discussing directions for future research.

2. Literature review

Retrospective spatio-temporal data analysis andunivariate surveillance provide a modeling and algo-rithmic foundation for the development of prospectivesurveillance approaches. We briefly survey the existingresearch in these two areas in Sections 2.1 and 2.2before discussing the representative prospective surveil-lance methods in Section 2.3.

2.1. Retrospective spatio-temporal data analysis

Retrospective approaches determine whether obser-vations or measures are randomly distributed over spaceand time for a given region. Clusters of data points ormeasures that are unlikely under the random distributionassumption are reported as anomalies. From a modelingand operational standpoint, retrospective methodsrequire the user specify “normal” background or base-line dataset which is used to calculate the expected datadistribution. This expected distribution is then comparedagainst data points of interest to determine whetherunusual clusters of events relative to the baseline data,have emerged from these data points under investiga-tion. A key difference between retrospective analysisand standard clustering lies in this concept of “baseline”

Page 3: A stack-based prospective spatio-temporal data analysis approach

699W. Chang et al. / Decision Support Systems 45 (2008) 697–713

data. For standard clustering, data points are groupedtogether directly based on the distances between them.Retrospective analysis, on the other hand, is notconcerned with such clusters. Rather, it aims to findout whether unusual clusters formed by the data pointsof interest exist relative to the baseline data points.These baseline data points represent how the normaldata should be spatially distributed given known factorsor background information. Clusters identified in thisrelative sense provide clues about dynamic changes inspatial patterns and indicate the possible existence ofunknown factors or emerging phenomena that maywarrant further investigation. As such, retrospectiveanalysis can be conceptualized as a spatial “before andafter” comparison.

Below we discuss two major types of retrospectiveanalysis methods: scan statistic-based and clustering-based. A comparative study of these two types ofretrospective approaches can be found in [22].

2.1.1. Scan statistic-based retrospective analysisVarious types of scan statistics have been developed

in the past four decades for surveillance and monitoringpurposes in a wide range of application contexts. Forspatio-temporal data analysis, a representative method isthe spatial scan statistic approach developed byKulldorff [10]. This method has become one of themost popular methods for detection of geographicaldisease clusters and is being widely used by publichealth departments and researchers. In this approach, thenumber of events, e.g., disease cases, may be assumed tobe either Poisson or Bernoulli distributed. Algorithmi-cally, the spatial scan statistic method imposes a circularwindow on the map under study and lets the center ofthe circle move over the area so that at differentpositions the window includes different sets of neigh-boring cases. While analyzing the data, the methodcreates a large number of distinct circular windows(other regular shapes such as rectangular and ellipsehave also been used), each with a different set ofneighboring areas. Each of these windows represents apossible candidate for containing an unusual cluster ofevents. A likelihood ratio is defined on each circle tocompute how likely the cases of interest fall into thatcircle not by pure chance. The circle with the maximumlikelihood ratio is in turn reported as spatial anomaliesor hotspots.

2.1.2. Clustering-based retrospective analysisDespite the success of the spatial scan statistic and its

variations in spatial anomaly detection, one of the majorcomputational problems faced by this type of methods is

that the scanning windows are limited to simple, fixedsymmetrical shapes for analytical and search efficiencyreasons. As a result, when the real underlying clusters donot conform to such shapes, the identified regions areoften not well localized.

To overcome this major computational limitation, inour previous work we have developed an alternative andcomplementary modeling approach called Risk-adjustedSupport Vector Clustering (RSVC) [23]. Recall thatretrospective analysis differs from standard clustering inthat clustering has to be performed in a relative senseconsidering the baseline data points. In RSVC, weapplied the “risk adjustment” idea from a crime hotspotanalysis approach called Risk-adjusted Nearest Neigh-bor Hierarchical Clustering (RNNH) [12] to incorporatebaseline information in clustering. The basic intuitionbehind RSVC is as follows. A robust SVM-basedclustering mechanism called Support Vector Clustering(SVC) [1], which supports detection of clusters witharbitrary shapes based on distances defined over pairs ofdata points, is used as the underlying clustering engine.Risk adjustment refers to the idea of adjusting distancemeasures proportionally to the estimated density of thebaseline factor such that in areas with high baselinedensity, it is more difficult to group data points togetheras clusters since the distances between these data pointshave been adjusted upward. An outline of our RSVCapproach is summarized in Section 3.3.

2.2. Univariate surveillance

Univariate surveillance methods monitor one-dimen-sional data streams (typically time series without spatialinformation) and focus on quick and accurate detectionand response in cases where unusual events take place.These methods vary in the alarm functions used andthe procedures followed to observe the time series. Acomprehensive review of univariate surveillance ap-proaches in the context of public health surveillance canbe found in [20]. Three types of surveillance strategiesare commonly used: (a) the cumulative sum (CUSUM)method monitoring the number of events in a fixedinterval, (b) Chen's set method monitoring the timeintervals between consecutive events [3], and (c)Frisen's approach monitoring the likelihood of anobserved occurrence [4]. Among them, the CUSUMapproach is the easiest to implement in practice sincesurveillance analysis is typically performed regularlyand tracking the number of events between twoconsecutive surveillance runs can be easily done. Forinstance, the CUSUM approach has been extensivelyused in the industry to monitor the number of defective

Page 4: A stack-based prospective spatio-temporal data analysis approach

700 W. Chang et al. / Decision Support Systems 45 (2008) 697–713

products in a manufacturing process for the purpose ofquality control.

CUSUM operates by accumulating the deviationsbetween the observations and expectations. Formally,assume that X is the variable that we are keeping track ofand Xt its value at time t. Denote by Zt the normalizeddeviation of Xt , zt ¼ Xt�A

r , where u is the mean and σthe variance. The accumulated deviation at time t,denoted by St , can then be given as

St ¼max St�1 þ zt � k; 0ð Þ; S0 ¼ 0 ð1Þwhere k is the normal varying range of X. When theaccumulated deviation St is over some pre-definedthreshold value, an alarm will be generated indicating anincrease on the mean of the underlying variable ofinterest X. From Eq. (1), we note that St will be reset to 0when the time series comes back to the normal status(i.e., the accumulated deviation is less than normalvarying range).

Several evaluation metrics have been developed tomeasure the performance of univariate surveillancemethods [20]. Among them, ARL0 and ARL1 are twoconjugated and widely accepted measures. ARL0 is theaverage run length until the first alarm is triggered underthe null hypothesis and ARL1 is the average run lengthuntil the first alarm is triggered under the alternativehypothesis. In other words, ARL0 estimates how fast asurveillance algorithm might trigger a false alarm andARL1 estimates how fast an algorithm can detect ananomaly if it does occur. ARL0 and ARL1 areconceptually similar to the type 1 and type 2 errors fromstatistical hypothesis testing. In addition to statisticalpower evaluation, a common evaluation method used inthe surveillance literature is to fix ARL0 and compareARL1 for different approaches. This evaluation amountstomeasuring how fast a surveillance approach can detect atrue abnormal event given a fixed level of false alarm rate.Another less frequently-used measure is expected delaywhich calculates the expected time between the time ananomaly occurs and the time an alarm is triggered. Thismeasure is suitable in cases where the distribution of theanomaly occurring time is known and the time needed totrigger an alert after an anomaly occurs depends on theoccurring time of the anomaly.

2.3. Prospective spatio-temporal surveillance

In the public health domain, the threats of bioterror-ism, catastrophic natural disasters, and major infectiousdisease outbreaks have recently generated great interestsin developing and deploying prospective spatio-tem-poral surveillance systems for timely event detection

and preemptive reactions. A major advantage thatprospective approaches have over retrospective ap-proaches is that they do not require the separation be-tween the baseline cases and cases of interest in the inputdata. Such a requirement is necessary in retrospectiveanalysis and is a major source of confusion anddifficulty to the end users. Prospective methods bypassthis problem and process data points continuously in anonline context.

Two types of prospective spatio-temporal dataanalysis approaches have been developed in thestatistics literature. The first type segments the surveil-lance data into chunks by arrival time and then applies aspatial clustering algorithm to identify abnormalchanges. In essence, this type of approach reduces aspatio-temporal surveillance problem into a series ofspatial surveillance problems. The second type expli-citly considers the temporal dimension and clusters datapoints directly based on both spatial and temporalcoordinates. We briefly summarize representativeapproaches for both types of methods includingRogerson's method and the space–time scan statistic.

2.3.1. Rogerson's methodsRogerson has developed CUSUM-based surveillance

methods to monitor spatial statistics such as Tango andKnox statistics, which capture spatial distributionpatterns existing in the surveillance data [15,16]. Let Ct

be the spatial statistic (e.g., Tango or Knox) at time t. Thesurveillance variable is defined as Zt ¼ Ct�E Ct jCt�1ð Þ

r Ct jCt�1ð Þ .Refer to [15,16] for the derivation of the conditionalexpected value E(Ct|Ct− 1) and the correspondingvariance σ(Ct|Ct−1). Following the CUSUM surveil-lance approach as shown in Eq. (1), when the accumu-lated deviation Zt exceeds a threshold value, the systemwill trigger an alarm. Rogerson's methods have suc-cessfully detected the onset of the Burkitt's lymphoma inUganda during 1961 to 1975. The variations and otherapplications of Rogerson's approaches can be found in[17–19].

2.3.2. Space–time scan statisticKulldorff has extended his retrospective 2-dimen-

sional spatial scan statistic to a 3-dimensional space–time scan statistic, which can be used as a prospectiveanalysis method [11]. The basic intuition is as follows.Instead of using a moving circle to search the area ofinterest, one can use a cylindrical window in threedimensions. The base of the cylinder represents space,exactly as with the spatial scan statistic, whereas theheight of the cylinder represents time. For each possiblecircle location and size, the algorithm considers every

Page 5: A stack-based prospective spatio-temporal data analysis approach

701W. Chang et al. / Decision Support Systems 45 (2008) 697–713

possible starting and ending times. The likelihood ratiotest statistic for each cylinder is constructed in the sameway as for the spatial scan statistic. After a computa-tionally-intensive search process, the algorithm can tellthe user where the abnormal cluster is with thecorresponding geolocations and time period. Thespace–time scan statistic has successfully detected anincreased rate of male thyroid cancer in Los Alamos,New Mexico during 1989–1992 [11].

3. Prospective Support Vector Clustering (PSVC)

3.1. Technical motivation

Although well-grounded in theoretical development,both Rogerson's methods and the space–time scanstatistic have major computational problems. Rogerson'sapproaches can monitor a given target area but theycannot search for problematic areas or identify thegeographic shapes of these areas. The space–time scanstatistic method performs poorly when the true abnormalareas do not conform to simple shapes such as circles.Furthermore, the 3-dimensional search needed by the scanstatistic method is time consuming. The lack of a pros-pective method that is able to detect unusual geographicalregions with arbitrary shapes in a timely manner providesdirect motivations for our technical research on prospec-tive spatio-temporal data analysis. In Section 3.2, weintroduce the basic ideas behind our approach, which iscalled Prospective Support Vector Clustering (PSVC). InSection 3.3, we present in detail the PSVC algorithm.Another intended contribution of our research is todevelop a unified quantitative framework to evaluate andcompare different prospective spatio-temporal surveil-lance approaches. In Section 4, we report this frameworkand an experimental study using simulated datasets tocompare PSVC and the space–time scan statistic.

3.2. Algorithm design

Our PSVC approach follows the design of the first typespatio-temporal surveillance method discussed in Section2.3, which involves repeated spatial clusterings over time.More specifically, the time horizon is first discretizedbased on the specific characteristics of the data streamunder study.Whenever a new batch of data arrives, PSVCtreats the data collected during the previous time frame asthe baseline data and runs the retrospective method,RSVC, a robust analysis method developed in our priorresearch [23], to identify abnormal clusters.

After obtaining a potential abnormal area, PSVC triesto determine how statistically significant the identified

spatial anomaly is. Many indexes have been developedto assess the significance of the results of clusteringalgorithms in general. Halkidi et al. [7,8] summarize andcategorize these indexes into three categories: (a) ex-ternal criteria which evaluate the results of a clusteringalgorithm based on a subjective, pre-specified partitionstructure, (b) internal criteria which describe how datatend to group together using the dataset itself alone, and(c) relative criteria which compare different clusteringresults from the same algorithm but with differentparameter values. However, all these criteria assessclustering in an absolute sense without consideringbaseline information. Thus they are not readily suitablefor prospective spatio-temporal data analysis.

Kulldorff's likelihood ratio [10] L(Z) as defined inEq. (2) is to our best knowledge the only statistic thatexplicitly takes baseline information into account.

L Zð Þ ¼ c

n

� �c1� c

n

� �n�c C � c

N � n

� �C�c

1� C � c

N � n

� � N�nð Þ C�cð Þ

ð2ÞIn this definition, C and c are the number of cases in

the entire dataset and the number of cases within thescanned area Z, respectively. N and n are the totalnumber of cases and baseline points in the entire datasetand the total number of the cases and the baseline pointswithin Z, respectively. Since the distribution of thestatistic L(Z) is unknown, the Monte Carlo simulationapproach is an alternative to calculate statistical signif-icance measured by the p-value. In this context, thedistribution used in Monte Carlo simulation is thedistribution under the null hypothesis. Since our nullhypothesis is that points are spatially randomly dis-tributed, the coordinate pair (x, y) of each simulated datapoint is subject to a jointly uniform distribution. In ourexperiment, supports for x and y coordinates are both setto [0, 20]. Using this method, we first generate Treplications of the dataset. We then calculate the likeli-hood ratio L(Z) on the same area Z for each replication.Finally, we rank these likelihood ratios and if L takes theX'th position, the p-value is set to X / (T+1).

Note that in a straightforward implementation of theabove algorithmic design, alerts are triggered only whenadjacent data batches have significant changes in termsof data spatial distribution. This localized myopic view,however, may lead to significant delay in alarm trig-gering or even false negatives because in somecircumstances, unusual changes may manifest gradu-ally. In such cases, there might not be any significantchanges between adjacent data batches. However, theaccumulated changes over several consecutive batches

Page 6: A stack-based prospective spatio-temporal data analysis approach

702 W. Chang et al. / Decision Support Systems 45 (2008) 697–713

can be significant and should trigger an alarm. Thisobservation suggests that a more “global” perspectivebeyond comparing adjacent data batches is needed.

It turns out that the CUSUM approach provides asuitable conceptual framework to help design a compu-tational approach with such a global perspective. Theanalogy is as follows. In the CUSUM approach, accu-mulative deviations from the expected value are explicitlykept track of. In prospective analysis, it is difficult todesign a single one-dimensional statistic to capture whatthe normal spatial distribution should look like and tomeasure the extent to which deviations occur. However,conceptually the output of a retrospective surveillancemethod such as RSVC can be viewed as the differences ordiscrepancies between two data batches, with the baselinedata representing the expected data distribution. Inaddition, accumulative discrepancies can be computedby running RSVC with properly-set baseline and casedata separation. For an efficient implementation, we use astack as a control data structure to keep track of RSVCruns which now include comparisons beyond data fromadjacent single periods. The detailed control strategy isdescribed below.

When clusters generated in two consecutive RSVCruns have overlaps, we deem that the areas covered bythese clusters are risky areas. We use the stack to storethe clusters along with the data batches from whichthese risky clusters are identified. Then we run RSVC tocompare the current data batch with each element (in theform of a data batch) of the stack sequentially from thetop to the bottom to examine whether significant spatialpattern changes have occurred. The objective of thestack is similar to that of variable S in Eq. (1) as part ofthe CUSUM approach, which accumulates the deviationZ between the observed value and expected value of themonitored variable. Stacks whose top data batch is notthe current data batch under examination can be emptiedsince the risky areas represented by them are no longer“alive.” This operation resembles one of the steps in theCUSUM calculation where the accumulated deviation isreset to 0 when the monitored variable is no longerwithin the risky range.

3.3. PSVC algorithm

Since the RSVC algorithm provides a criticalcomputational component that is used in several stepsof the PSVC algorithm, we first briefly describe theRSVC algorithm itself [23]. First, using only thebaseline points, a density map is constructed usingstandard approaches such as the kernel density estima-tion method. Second, the case data points are mapped

implicitly to a high-dimensional feature space definedby a kernel function (typically the Gaussian kernel). Thewidth parameter in the Gaussian kernel function isdynamically adjusted based on the kernel densityestimates obtained in the previous step. The basic intui-tion is as follows. When the baseline density is high, alarger width value is used to make it harder for points tobe clustered together. Third, following the SVM ap-proach, RSVC finds a hypersphere in the feature spacewith a minimal radius to contain most of the data. Theproblem of finding this hypersphere can be formulatedas a quadratic or linear program depending on the dis-tance function used. Fourth, the function estimating thesupport of the underlying data distribution is then con-structed using the kernel function and the parameterslearned in the third step. When projected back to theoriginal data space, the identified hypersphere ismapped to (possibly multiple) clusters. These clustersare then returned as the output of RSVC.

We now explain the main steps of the PSVC algorithmas shown in Fig. 1. Each cluster stack represents acandidate abnormal area and the array clusterstacks holdsa number of cluster stacks keeping track of all candidateareas at stake. Initially (line 1) clusterstacks is empty. Thesteps from lines 3 to 35 are run whenever a new data batchenters the system. First, the RSVC retrospectivemethod isexecuted (line 3) to compare the spatial distribution of thenew data batch with that of the previous data patch. Theresulting abnormal clusters are saved in rsvcresult. Anystatistically significant cluster in rsvcresult will immedi-ately trigger the alert (line 5).

For those emerging candidate areas that are not yetstatistically significant, they are kept in clusterstacks.Lines 7 to 32 of the PSVC algorithm describe theoperations to be performed on each of these candidateclusters C. If no cluster stack exists, we simply create anew cluster stack which contains only C as its member(line 30), and update the array clusterstacks accordingly(line 31). If cluster stacks already exist, for each of thesecluster stack S, we determine whether the current clusterC has any overlaps with the most recent cluster (the topelement) in S (line 10). If the current cluster C doesoverlap with an existing candidate area, further investiga-tion beyond comparison between adjacent data batcheswill be warranted.

The operations described from lines 11 to 15implement these further investigative steps. First, clusterC is added onto stack S (line 11). Then the current databatch is compared against all remaining data batches inS in turn from the top to the bottom. Should anysignificant spatial distribution change be detected, analert will be triggered (lines 13 to 15).

Page 7: A stack-based prospective spatio-temporal data analysis approach

Fig. 1. PSVC algorithm.

703W. Chang et al. / Decision Support Systems 45 (2008) 697–713

If cluster C does not overlap with any of the mostrecent cluster in all of the existing cluster stacks, a newcluster stack is created with C as its only element and thearray clusterstacks is updated accordingly (lines 22 and23). After processing the candidate cluster C, we removeall inactive cluster stacks whose top clusters are notgenerated at the present time (equal to the creation time ofC) (line 26). Note that two stacks may have the same topelement. However, because the accumulated deviation ofspatial distribution stored in these two stacks might bedifferent and this accumulated deviation information mayproduce valuable information as to deciding whether totrigger an alert or not, we do not merge these two clusters.

4. An experimental study

This section reports an experimental study designedto evaluate the proposed PSVC method quantitativelyand compare its performance with that of existingprospective analysis methods represented by the space–time scan statistic. We first discuss the evaluationmeasures used in our study in Section 4.1 and thenpresent research hypotheses to be examined in Section4.2. In Section 4.3 we describe the simulated datasetsused in our experiments and briefly present variouscomponents of a research prototype that has provided a

testbed for the reported experiments. Experimentalfindings are reported in Section 4.4.

4.1. Evaluation measures

To evaluate a prospective spatio-temporal dataanalysis method, we need to consider both spatial andtemporal evaluation measures. From a spatial perspec-tive, the goal is to evaluate how accurate the detectedclusters are geographically, relative to the location of thetrue clusters. To this end, we follow the well-knowninformation retrieval measures such as precision, recall,and F-score, which have been shown to be meaningfuland informative in the retrospective analysis context[23]. Specifically, let A denote the size of the abnormalarea identified by a given algorithm, B the size of thetrue abnormal area, and C the size of the overlappedarea between the abnormal area identified by thealgorithm and the true abnormal area. Precision isdefined as C/A. Recall is defined as C/B. F-measure isdefined as the harmonic mean of precision and recall(2⁎Precision⁎Recall / (Precision+Recall)). Observethat high recall indicates low false negatives and thathigh precision indicates low false positives.

To gain further insights about the performancecharacteristics of the algorithm, we also use the ROC

Page 8: A stack-based prospective spatio-temporal data analysis approach

704 W. Chang et al. / Decision Support Systems 45 (2008) 697–713

curve to show the performance tradeoffs betweensensitivity and specificity. ROC curves have been widelyused to evaluate various types of machine learning anddata mining approaches. These curves are based on thegeneral observation that the parameters of an algorithmcan be tuned such that the algorithm achieves higherprecision at the expense of lowered recall or the other wayaround. Sensitivity, plotted as the y-axis of the ROCcurve, is the same as recall. 1-specificity, plotted as the x-axis of the ROC curve, is the fraction of the false positivesover all the negatives (false positives and true negatives),which indicates to what extent the algorithm can make thewrong predictions.

As for the temporal evaluation measures, weintroduced ARL0 and ARL1 as two widely used onesin univariate surveillance. ARL1 reveals how timely analgorithm can detect an anomaly and ARL0 how easilyan algorithm tends to trigger a false alarm. In our study,we adopt the ARL1 measure and rename it to “AlarmDelay,” which is defined as the delay between the timean anomaly occurs and the time the algorithm triggersthe corresponding alert. Using ARL0 can be difficult inpractice as it would require the system run for a longtime under the normal condition to collect false alarmdata. As an alternative, we have followed the followingperformance data collection procedure: We apply theprospective analysis method under study to a simulateddata stream for a relatively long period of time. Thisdata stream contains some anomalies generated accord-ing to known patterns. When a suspicious area reportedby the method does not overlap with the true abnormalarea (e.g., both precision and recall are 0) or the reportdate is earlier than the actual date of the abnormaloccurrence, we consider it as a false alarm. In some

Fig. 2. Components

cases, the system fails to trigger any alarms during theentire monitoring period. We count how many times analgorithm triggers false alarms and how many times itfails to detect the true anomalies as surrogate measuresfor ARL0. Note that, in real-world practice, “falsealarm” may have different meanings for differentexperts. When reporting the experimental findingsbased on simulated datasets, we use the definitionabove to evaluate our approach. For the experimentalstudy involving real-world datasets, however, we willnot report any results under the false alarm measure.

4.2. Research hypotheses

We have chosen the space–time scan statistic as thebenchmark method since it has been widely tested anddeployed, especially in public health applications, and itsimplementation is freely available through a softwarepackage called SaTScan. Both spatial scan statistic (forretrospective surveillance purposes) and space–timescan statistic (for prospective surveillance purposes)are made available from the SaTScan package. In ourexperiments, we use the space–time scan statistic imple-mentation exclusively. To simplify the exposition, inSections 4 and 5, we use SaTScan to denote the space–time scan statistic method. SaTScan aims to identify thearea with the maximum value of the likelihood functionL(z) as defined in Eq. (2). For an area with elevated risk,the portion of the cases falling into the area should behigher than the portion of general points (cases andbaseline points) falling into it. Mathematically, for anrisky area, c/CNn/N. It is easy to prove that if c/CNn/N,the first derivative of L(z) with respect to c is positive.Hence, given the same quantities of the baseline points,

of the testbed.

Page 9: A stack-based prospective spatio-temporal data analysis approach

Fig. 3. A problem instance of the “emerging” scenario.

705W. Chang et al. / Decision Support Systems 45 (2008) 697–713

the scanning cylinder of SaTScan tends to include morepositive cases to result in a higher likelihood value. Inother words, SaTScan inherently tends to reach a higherlevel of recall at the expense of the lowered precision.This plus the shape limitation of the scanning windowmay impede its clustering performance.

From our previous work on spatio-temporal dataanalysis [23], we found that our RSVC approach canachieve a higher F score than the 2-dimensional spatialscan statistic. With respect to spatial evaluation mea-sures, we expect that PSVC is able to outperform the 3-dimensional space–time scan statistic similarly.

In terms of temporal measures, it is unclear whichmethod will be able to detect the anomaly in a moretimely manner. Our pilot experiments [2] seem to indi-cate that in most circumstances, PSVC and SaTScan candetect the anomaly at about the same time. For consis-tency with other hypothesis, we include the alternativehypothesis stating that there is a difference between thesetwo methods in alarm delay in our study.

Fig. 4. Snapshots of an “emerging” scenario problem instance. (For interpretato the web version of this article.)

SaTScan method relies on an exhaustive search everytime it is invoked to process the data stream. PSVC, onthe other hand, compare each incoming data batch withonly the most recent data batch and possibly with acarefully-controlled selected set of historic data batches.Computationally, in the SaTScan searching process, aset of three points can determine a candidate cylindricalcluster upon which statistical test can be performed.Among these three points, the first is the center of thebase circle; the second helps determine the radius of thebase circle (given as the distance between the first pointand second point); and the third point identifies theheight of the candidate cylinder (given as the distancebetween the third point and the base determined by thefirst and second point). Given N input data points,SaTScan needs to perform statistical tests for a numberof times in the magnitude of N3 to search for the mostsignificant unusual clusters. On the contrary, the majorstep in our PSVC algorithm is the kernel computationwhich involves calculating the distance of every pair of

tion of the references to color in this figure legend, the reader is referred

Page 10: A stack-based prospective spatio-temporal data analysis approach

Table 1Average performance of SaTScan and PSVC over 30 “emerging” scenario instances

Precision (%) Recall (%) F-measure (%) Alarm delay (days) False alarm (times) Fail to detect (times) Computing time (seconds)

SaTScan 66.2 83.6 69.5 5.4 5 2 607PSVC 88.5 55.2 64.8 6.0 0 2 95

706 W. Chang et al. / Decision Support Systems 45 (2008) 697–713

two data points. This O(N3) complexity of SaTScan andO(N2) complexity of PSVC lead us to form the hypo-thesis stating that PSVC runs faster than SaTScan.

We formalize the above expectations in five hy-potheses to be examined in our experimental study.

H1. PSVC achieves higher precision than SaTScan.

H2. SaTScan achieves higher recall than PSVC.

H3. PSVC outperforms SaTScan in overall spatialperformance quantified by the F-mean.

H4. There is significant difference between PSVC andSaTScan as to alarm delay.

H5. PSVC runs faster than SaTScan.

Table 2Results of hypothesis testing for the “emerging” scenario

Hypotheses p-value Results

H1: PSVC achieves higher precision. b0.001 AcceptH2: SaTScan achieves higher recall. b0.001 AcceptH3: PSVC achieves higher F-measure 0.126 RejectH4: There is a difference in alarm delay 0.416 RejectH5: PSVC runs faster than SaTScan 0.105 Reject

4.3. Datasets and a research testbed

We have conducted a series of computational studiesto evaluate PSVC and SaTScan based on the perfor-mance measures discussed in Section 4.1. In order tocompute precise measures and compare these twomethods' performance quantitatively, we need to knowthe precise location and the effective time period of thetrue abnormal clusters. To this end, we have used simu-lated datasets with the generation of the true clusters fullyunder our control. (of course, neither of the approachesunder study has access to such information about thesetrue clusters). In Section 5, we report another study usingtwo real-world datasets to compare qualitatively PSVCand SaTScan. In that case, quantitative performancemeasures cannot be computed as the true clusters areunknown. But the computational insights gained and therelated anecdotal evidence from the domain experts inthese two case studies complement the more quantitativelessons learned using controlled experiments.

Three sets of simulated datasets have been used in ourevaluation study. They represent three common scenar-ios in spatio-temporal data analysis. For ease of expo-sition, throughout this section, we use the public healthapplication context to illustrate these scenarios, whichinclude the “emerging” scenario, the “expanding” sce-nario, and the “moving” scenario. The emerging scenariocorresponds to disease outbreaks that start from somelocation where very few disease incidents occurred

before. In the expanding scenario, the disease cases arefirst concentrated on a particular infected area and thenspread to the neighboring area. The moving scenariocaptures the movement of the infected area along certaindirections possibly due to some environmental factors,such as river and wind. For each of these three scenarios,we created 30 problem instances by randomly changingthe size, location, starting date, and the speed of expan-sion of these simulated abnormal clusters.

A prototype system has been developed as a testbed toconduct the evaluation study. The major system compo-nents of this testbed are shown in Fig. 2. The “DataGenerator”module generates simulated datasets for these“emerging”, “expanding”, and “moving” scenarios. Eachdata record includes geolocation (x, y) coordinates and atime stamp. The “Data Splitter” module divides eachcomplete problem/scenario instance into a number of databatches that are then fed to the prospective data analysismodule sequentially according to the data records' timestamps. The SaTScan implementation of the space–timescan statistic method is used in our study. All the modulesrelated to PSVC are home-grown including the RSVCengine [23] and the “Cluster Evaluator” which identifiesstatistically significant clusters using Kulldorff's like-lihood ratio and theMonte Carlo technique as discussed inSection 3.2. Finally, the experimental results including thedetected abnormal clusters and the detection times arerecorded and can be visualized.

4.4. Experimental findings

In this section we first describe for each of the threescenarios the detailed data generation and experimentalprocedures and summarize the related experimentalfindings as to spatial and temporal evaluation measures.

Page 11: A stack-based prospective spatio-temporal data analysis approach

Fig. 5. A problem instance of the “expanding” scenario.

707W. Chang et al. / Decision Support Systems 45 (2008) 697–713

We then report additional experimental results concern-ing the tradeoff between specificity and sensitivity usingROC curves for both methods under study. Towards theend, we summarize our experimental findings and reporton the running time of both methods. As a convention,we use x and y axes to represent the spatial coordinatesand the z-axis as the time.

4.4.1. The emerging scenarioFor all “emerging” scenario problem instances, both

x and y axes have the support of [0, 20]. The range fortime is from 0 to 50 days. We first generated 300 datapoints in this 3-dimensional space ([0, 20]× [0, 20]× [0,50]) as the background. We then generated another 300data points inside a cylinder whose bottom circle residesat center (xl, yl) with radius rl. The height of thiscylinder is set to 50, covering the entire time range. Thiscylinder is designed to test whether a prospective spatio-temporal data analysis method might identify the purespatial cluster by mistake.

Consider the dense cone-shaped area in the left sub-figure of Fig. 3. An abnormal circular cluster which iscentered at (xr, yr) emerges on some date startT. Thiscircle starts with radius startR and continuously expandsuntil the radius reaches endR on the last day, day 50. Incontrast to the cylinder to the left which has roughly thesame number of data points every day, the cone-shaped

Table 3Average performance of SaTScan and PSVC over 30 “expanding” scenario

Precision (%) Recall (%) F-measure (%) Alarm delay (days)

SaTScan 66.0 25.6 34.8 8.2PSVC 92.7 38.3 51.2 12.7

area represents an emerging phenomenon. To approx-imate exponential expansion, we let the number of pointsinside the cone-shaped area at any given day follow thefollowing expression:

a⁎ current¯date� start

¯dateþ1

� �^increaserate

where a is the number of points inside the area on theanomaly starting date and increaserate indicates howfast an outbreak expands. Fig. 4 shows three snapshotsof an emerging scenario problem instance projected tothe spatial map at three different times. The red crossesrepresent the new data batch for the current time frameduring which the analysis is being conducted. The bluestars represent the data points from the last time frame.As shown in these snapshots, until day 22 there is nonotable spatial pattern change during two consecutiveweeks. But during the week from day 22 to day 29, wecan clearly observe an emerging circle.

When generating data points for 30 replications of theemerging scenario, we aimed to experiment with thecone-shaped area and the cylinder of varying sizes andlocations under two constraints: (a) neither area iscompletely inside the other area, and (b) both areas areconfined within the boundary of the 3-dimensionalspace. Under this guideline, we carefully generated theexperimental parameters as follows: xl, yl, rl, xr, yr are

instances

False alarm (times) Fail to detect (times) Computing time (seconds)

7 6 8550 6 516

Page 12: A stack-based prospective spatio-temporal data analysis approach

Table 4Results of hypothesis testing for the “expanding” scenario

Hypotheses p-value Results

H1: PSVC achieves higher precision. b0.001 AcceptH2 a: PSVC achieves higher recall. b0.001 AcceptH3: PSVC achieves higher F-measure b0.001 AcceptH4: There is a difference in alarm delay 0.056 RejectH5: PSVC runs faster than SaTScan 0.155 Rejecta This is the inverse of the original version of H2.

708 W. Chang et al. / Decision Support Systems 45 (2008) 697–713

uniformly distributed on intervals [5,22], [3,22], [13,14],[4,15], and [1,10], respectively; the anomaly startingdate startT, is uniformly distributed on [18,35] and thestarting and ending radiuses of the emerging circle startRand, endR are uniformly distributed on [1,2], and [3,5],respectively; a and increaserate are uniformly distrib-uted on [2,4] and [0.2,1.5], respectively. Prospectiveanalysis was conducted on a weekly basis with eachbatch containing around 80–100 data points.

The right sub-figure of Fig. 3 illustrates the results ofthe analyses using SaTScan and PSVC on the probleminstance shown in the left sub-figure. As expected, bothmethods reported an emerging abnormal area. Neitherreported the pure spatial cluster (cylinder), which ispositive. The average performance of PSVC andSaTScan over the 30 problem instances is summarizedin Table 1. Related hypothesis testing results based onthe research questions proposed in Section 4.2 arepresented in Table 2. We observe that for the emergingscenario, SaTScan achieves a higher level of recall andPSVC a higher level of precision. These two methods donot differ significantly with respect to the overall spatialperformance given by the F-measure. This is due to thefollowing fact: when dealing with scenarios where

Fig. 6. A problem instance of

abnormal areas follow simple symmetric shapes, SaTS-can performs very well and PSVC can only achievesimilar performance (PSVC outperforms SaTScan sig-nificantly when dealing with irregular shapes).

4.4.2. The expanding scenarioFig. 5 illustrates a problem instance in the expanding

scenario. Initially the data points are confined to a circle;starting at a certain time, however, the data points grad-ually spread out to the neighboring area of the originalcircle. In this case, the true abnormal area does not followa simple shape. Instead it takes on an irregular shapewhich is the area under a large outer circle (the originalcircle plus the neighboring area) but excluding a smallerinner circle (the original circle). The average perfor-mance of PSVC and SaTScan and related hypothesistesting results are summarized in Tables 3 and 4. Weobserve that the average recall of SaTScan dropssignificantly from 83.6% to 25.6% largely due to itsshape limitation. Overall, PSVC demonstrated superiorspatial performance and generated fewer false alarms.

4.4.3. The moving scenarioIn the “moving” scenario, as illustrated in Fig. 6, most

of the data points are initially confined to a rectangle inthe south-west corner of the map. At a certain time, therectangle begins to move to the northeast. The trueabnormal area in such a problem instance is the relocatedrectangle but excluding the overlapped area with theoriginal rectangle. Tables 5 and 6 summarize the averageperformance and the results of hypothesis testing,respectively. As shown in the right sub-figure of Fig. 6,SaTScan resulted in two small circles to cover the truecluster whereas PSVC returned one connected area with

the “moving” scenario.

Page 13: A stack-based prospective spatio-temporal data analysis approach

Table 5Average performance of SaTScan and PSVC over 30 “moving” scenario instances

Precision (%) Recall (%) F-measure (%) Alarm delay (days) False alarm (times) Fail to detect (times) Computing time (seconds)

SaTScan 60.0 59.7 58.5 8.0 3 0 197PSVC 73.2 66.3 68.6 8.3 3 0 279

709W. Chang et al. / Decision Support Systems 45 (2008) 697–713

an irregular shape. Compared with the “expanding”scenario, the recalls of both methods improve signifi-cantly (SaTScan from 25.6% to 59.7% and PSVC from38.3% to 66.3%, respectively) at the expense of de-creased precisions (SaTScan from 66.8% to 60.0% andPSVC from 92.7% to 73.2%, respectively). In gen-eral, PSVC has achieved significantly higher precisionand F-measure.

4.4.4. ROC curvesTo gain a more comprehensive understanding of

PSVC and SaTScan's performance, we chose tworepresentative problem instances, one from the emergingscenario, the other from the expanding scenario, andtuned the control parameters of both methods to producethe ROC curves between sensitivity and 1-specificity.These curves are shown in Fig. 7. In the “emerging”scenario, as SaTScan seeks better sensitivity, itsspecificity performance drops dramatically. PSVCloses sensitivity quickly when it attempts to achievebetter specificity. In the “expanding” scenario, the ROCcurve of PSVC is completely above that of SaTScanshowing its superiority in both sensitivity and specificity.

4.4.5. SummaryWe now summarize the experimental findings

learned from all three scenarios.

• Both SaTScan and PSVC can effectively identify theabnormal areas demonstrating changes in the spatialdistribution pattern over time and correctly ignorepure spatial clusters.

• When the abnormal area follows a simple regularshape (e.g., a circle in the emerging scenario), PSVC

Table 6Results of hypothesis testing for the “moving” scenario

Hypotheses p-value Results

H1: PSVC achieves higher precision. b0.001 AcceptH2 a: PSVC achieves higher recall. 0.062 RejectH3: PSVC achieves higher F-measure 0.005 AcceptH4: There is a difference in alarm delay 0.664 RejectH5 b: SaTScan runs faster than PSVC b0.001 Accepta This is the inverse of the original version of H2.b This is the inverse of the original version of H5.

achieves better precision while SaTScan achievesbetter recall.

• Because of the detection power of the underlyingRSVC method, PSVC outperforms SaTScan, whichis limited to detect areas with simple regular shapes.The performance gap as measured by spatialevaluation measures is particularly large in scenariosinvolving areas with complex, irregular shapes as inthe case of the expanding and moving scenarios.

• The stack-based accumulation framework is meant toadapt RSVC, a retrospective method, for prospectiveuse. Although PSVC or SaTScan deliver similar per-formance as to the speed of detection in our simulations,because PSVC can identify abnormal clusters moreaccurately, it has fewer false alarms than SaTScan. Thisis particularly true when abnormal areas do not conformto simple regular shapes.

We conclude this section by making some commentson the running times of PSVC and SaTScan. We used aPentium IV computer with a 2.8GHz CPU and 1.0GRAM to run our experiments. Generally speaking, forboth methods, it takes less than 5min to produce theresults if the number of input data points is less than1000. Computational results based on three simulatedscenarios indicate that Hypothesis 5 regarding computa-tion speed should be rejected. We conducted a detailedanalysis of computing times of various steps of bothapproaches and summarize our findings below, whichhelp explain why Hypothesis 5 was rejected. Ourgeneral observation is that in addition to the number ofinput data points, there are many other factors affectingboth approaches' computation speed. In the case ofPSVC, the complexity of the shape of the resultingclusters can greatly influence its running time. ForSaTScan, the allowable maximum size of the circlesused to scan the entire area can greatly affect the MonteCarlo simulation and thus influence the overall runningtime. In addition, we have observed high variances inthe running times for both PSVC and (to a lesser degree)SaTScan across problem instances. As such, eventhough PSVC has a better worst-case computing time(O(N2)) than SaTScan (O(N3)) with N measuring thesize of the problem instance (the number of input datapoints), the actual time performance data based on the

Page 14: A stack-based prospective spatio-temporal data analysis approach

Fig. 7. ROC curves.

710 W. Chang et al. / Decision Support Systems 45 (2008) 697–713

simulated scenarios (which are relatively small in size)do not support this highly-simplified computing timecomparison result.

5. Two case studies: public health surveillance andcrime analysis

In many security-related applications, an issue ofcentral importance is to identify the regions of potentialinterest or unusual activities as soon as possible ascandidates for further investigation. In such applica-tions, knowing the precise locations of such regions andthe start and end of unusual events is critical. Additionalrefined measures such as intensity of activities withinthese regions play a role only after the area is identified.Our research exclusively focuses on the identification ofspatio–temporal areas. In this section, we analyze tworeal-world datasets to demonstrate how we can applyPSVC and SaTScan to detect the areas at risk. The firststudy concerns public health surveillance and theresearch goal is to identify emerging geographical

Fig. 8. WNV migration patterns iden

disease clusters as quickly as possible. The datasetused in this study contains the dead bird sightings in thestate of New York in Spring and Summer 2002. As deadbird clusters have been proven to be highly indicative ofWest Nile Virus (WNV) outbreaks, we applied PSVCand SaTScan to monitor the dead bird sighting data toidentify possible abnormal clustering effect. In ourdataset, there are 364 sightings in total. Before May2002, there are less than 10 records per week. We chosea 2-week data monitoring interval for PSVC. From theresults shown in Fig. 8, we note that most sightingsstayed inside Long Island before April 29th. However,in the next two weeks, more and more sightings startedto show up north of Long Island along Hudson River.Both PSVC and SaTScan detected an abnormal clusterforming on May 12, which is much earlier than May 26,the first day a dead bird was diagnosed with WNV. Thisautomated advance warning capability, albeit anecdotal,is of great interest and importance to the public healthcommunity from the viewpoints of infectious diseasemonitoring, modeling, and related resource allocation

tified by PSVC and SaTScan.

Page 15: A stack-based prospective spatio-temporal data analysis approach

Fig. 9. Emerging residence larceny incident clusters identified by PSVC and SaTScan.

711W. Chang et al. / Decision Support Systems 45 (2008) 697–713

and counter-measure planning. Fig. 8 also shows thatthe irregularly-shaped area detected by PSVC is moreinformative than the large circle detected by SaTScan.

Our second case study is in crime analysis using adataset consisting of 4705 residence larceny incidents in amiddle-sized city in U.S. from January 1, 2003 to March31, 2005. While processing these 2 years and 3 monthsworth of data, PSVC and SaTScan triggered 5 and 8alarms, respectively.We report themost significant clusteridentified by eachmethod. Fig. 9 shows the northwest partof the city where a lot of criminal activities take place. Theleft sub-figure shows the hotspot area identified by PSVCon January 3, 2004 and the right sub-figure by SaTScanon June 4, 2004. The red crosses represent the most recentlarceny incidents while the blue stars represent the larcenyincidents that occurred 2weeks ago. Both methodsidentified a high-risk area with emerging criminalactivities worth further investigation. To demonstrate thepotential usefulness of such prospective analysismethods,we quote below an experienced police officer whocommented on our case study. “If this continuouslymonitoring system can be hooked up with some easy-to-use interface, it would be very helpful for us to dispatchpatrol officers and arrange the patrol routes. Our patrolofficers usually are only in charge of a relative small areaand are well aware of the criminal activities within thatarea. But they do not have the big picture of what thecharacteristics of criminal activities outside their areas.This tool may potentially help us better understand theconnection of criminal activities from one district toanother. If it can notify me via email when any alert istriggered, I will be very happy to investigate what is going

on in that risky area. It will save me a lot of time to studythe distribution of criminal activities. Just hope it won'ttrigger too many false alarms.”

6. Conclusions and future work

Compared with retrospective methods, prospectiveanalysis methods provide a more powerful data analysisframework. Prospective methods are aimed at identifyingspatio-temporal interaction patterns in an online contextand do not require preprocessing data points into baselineand cases of interest. In this paper, we propose a newprospective approach called PSVC based on our previouswork on support vector clustering-based retrospectiveanalysis and the basic design ideas behind the well-knownCUMSUM approach. We report three computationalstudies using simulated datasets to quantitatively evaluateour approach and compare it against a well-known pro-spectivemethod based on the space–time scan statistic.Wealso present two case studies in public health surveillanceand crime analysis to demonstrate the potential value ofprospective surveillance in real-world applications.

Although PSVC shares a number of similarities withCUSUM and Rogerson's approaches in the overalldesign, PSVC is able to detect risky regions while theother approaches cannot. Following our proposed quan-titative evaluation framework, we show through compu-tational studies that PSVC is more capable to identifyrisky areas of irregular shapes, thanks to the power of theunderlying RSVC method; whereas scan statistic-basedmethods can only detect regions of regular shapes andtend to identify large circles, resulting in low precision.

Page 16: A stack-based prospective spatio-temporal data analysis approach

712 W. Chang et al. / Decision Support Systems 45 (2008) 697–713

We also note that the empirical testing of PSVC in termsof its computational efficiency indicates that it needs tobe further improved in order to outperform SaTScandespite the fact that PSVC has a lower worst-caserunning time.

We conclude this paper by discussing future research.First, it is possible to adapt Rogerson's approach, whichcan monitor a given area for anomaly, by (a) partitioningthe area under surveillance into grids, (b) monitoring eachgrid cell separately, and (c) grouping/clustering theanomalous cells into a hotspot area. This adaptationscheme, however, poses significant computational chal-lenges. One key problem is how to guide the partitioningstep to find the appropriate level of granularity. Acomparison between this possibly adapted Rogerson'sapproach with SaTScan and PSVC will be interesting.Our current research is exploring this line of research.Another interesting comparison is to examine theperformance difference between PSVC and the methodwhich applies spatial scan statistic in a stack-basedframework. We suspect, however, that performance ofthis adapted spatial scan statistic method might not bepromising as RSVC outperforms spatial scan statistic inseveral important regards. Secondly, one of the possibleimprovements to the PSVC algorithm is to speed up thecomputation through better control of the complexity ofthe shape of the resulting clusters. Thirdly, we plan toconduct further empirical and theoretical research toanalyze how timely PSVC can detect an anomaly and howit fares against SaTScan in terms of detection responsetime. Fourthly, another major area of extension is con-cerned with how to deal with multiple incidents occurringat exactly the same locations. In public health and manyother applications including crime analysis, the events arebeing recorded with spatial coordinates corresponding tothe location of service centers such as hospitals asopposed to the precise location of the incidents. Finally,for privacy reasons, sometimes the records only haveaggregated spatial coordinates such as the ZIP codes orthe county-level identifiers associated with them. How toprocess such aggregated spatial information presents atechnical challenge and a research opportunity of practicalrelevance. From an application perspective, issues such asuser interface, alert generation and filtering, and interfa-cing between the prospective surveillance system and theunderlying data sources, are also worth exploring.

Acknowledgments

Research reported in this paper was supported in partby the U.S. National Science Foundation through Grant#IIS-0428241. The second author is an affiliated professor

at the Institute of Automation, the Chinese Academy ofSciences, and wishes to acknowledge the support from aresearch grant (60573078) from the National NaturalScience Foundation of China, an international collabora-tion grant (2F05N01) from the Chinese Academy ofSciences, a National Basic Research Program of China(973) grant (2006CB705500) from the Ministry ofScience and Technology, and an Innovative ResearchGroup Grant (60621001) from the National ScienceFoundation of China. We wish to thank Dr. MillicentEidson, Dr. Ivan Gotham, Ms. Jenny Schroeder, and Mr.Tim Peterson for providing the datasets used in this studyand related discussions. We also thank other members ofthe NSF-funded BioPortal [9]and Coplink projects for theinformative and constructive discussions.

References

[1] A. Ben-Hur, D. Horn, H.T. Siegelmann,V. Vapnik, Support vectorclustering, Journal of Machine Learning Research 2 (2001)125–137.

[2] W. Chang, D. Zeng, H. Chen, A novel spatio-temporal dataanalysis approach based on Prospective Support Vector Cluster-ing, presented at Workshop on Information Technologies andSystems (WITS), Las Vegas, Nevada, 2005.

[3] R. Chen, A surveillance system for congenital malformations,Journal of theAmerican Statistical Association 73 (1978) 323–327.

[4] M. Frisen, J.D. Mare, Optimal surveillance, Biometrika 78(1991) 271–280.

[5] M. Gahegan, Data mining and knowledge discovery in thegeographical domain, National Academies white paper, 2001.

[6] D. Gunopulos, “Data mining techniques for geospatial applications,”National Academies white paper.

[7] M. Halkidi, Y. Batistakis, M. Vazirgiannis, Cluster validitymethods: part 1, SIGMOD Record 31 (2) (2002) 40–45.

[8] M. Halkidi, Y. Batistakis, M. Vazirgiannis, Clustering validitycheckingmethods: part II, SIGMODRecord 31 (3) (2002) 19–27.

[9] P.J.-H. Hu, D. Zeng, H. Chen, C. Larson, W. Chang, C. Tseng,J. Ma, “A System for Infectious Disease Information Sharingand Analysis: Design, Implementation, and Evaluation”, IEEETransactions on Information Technology in Biomedicine 11 (4)(2007) 483–492.

[10] M. Kulldorff, A spatial scan statistic, Communications instatistics-theory and methods 26 (1997) 1481–1496.

[11] M. Kulldorff, Prospective time periodic geographical diseasesurveillance using a scan statistic, Journal of the Royal StatisticalSociety. Series A 164 (2001) 61–72.

[12] N. Levine, CrimeStat III: A Spatial Statistics Program for theAnalysis of Crime Incident Locations, The National Institute ofJustice, Washington, DC, 2002.

[13] H.J. Miller, J. Han, Geographic Data Mining & KnowledgeDiscovery: An Overview, Taylor and Francis, London, 2001.

[14] J.F. Roddick, M. Spiliopoulou, A bibliography of temporal, spatialand spatio-temporal data mining research, SIGKDDExplorations 1(1999).

[15] P.A. RogerSon, Surveillance systems for monitoring the devel-opment of spatial patterns, Statistics in Medicine 16 (1997)2081–2093.

Page 17: A stack-based prospective spatio-temporal data analysis approach

713W. Chang et al. / Decision Support Systems 45 (2008) 697–713

[16] P.A. Rogerson, Monitoring point patterns for the development ofspace–time clusters, Journal of the Royal Statistical Society.Series A 164 (2001) 87–96.

[17] P.A. Rogerson, Y. Sun, Spatial monitoring of geographicpatterns: an application to crime analysis, Computers, Environ-ment and Urban Systems 25 (2001) 538–556.

[18] P.A. Rogerson, I. Yamada, Alternative approaches for syndromicsurveillance when data consist of small regional counts, 2003.

[19] P.A. Rogerson, I. Yamada, Monitoring change in spatial patternsof disease: comparing univariate and multivariate cumulativesum approaches, Statistics in Medicine 23 (2004) 2195–2214.

[20] C. Sonesson, D. Bock, A review and discussion of prospectivestatistical surveillance in public health, Journal of the RoyalStatistical Society. Series A 166 (2003) 5–21.

[21] X. Yao, Research issues in spatio-temporal data mining,presented at UCGIS workshop on Geospatial Visualization andKnowledge Discovery, Lansdowne,Virginia, 2003.

[22] D. Zeng, W. Chang, H. Chen, A comparative study of spatio-temporal hotspot analysis techniques in security informatics,presented at Proceedings of the 7th IEEE International Con-ference on Intelligent Transportation Systems, Washington, 2004.

[23] D. Zeng, W. Chang, H. Chen, “Clustering-based Spatio-TemporalHotspot Analysis Techniques in Security Informatics,” IEEETransactions on Intelligent Transportation Systems, in press.

Wei Chang received his Bachelor's andMaster's degrees in Management InformationSystems from Tsinghua University, China andUniversity of Arizona, U.S.A respectively. Heis currently pursuing his PhD degree inOperation Research and Decision Science inUniversity of Pittsburgh.

Dr. Daniel Dajun Zeng received the M.S.and Ph.D. degrees in industrial administra-tion from Carnegie Mellon University,Pittsburgh, PA, and the B.S. degree ineconomics and operations research fromthe University of Science and Technologyof China, Hefei, China. Currently, he is anAssociate Professor and Honeywell Fellowin the Department of Management Informa-tion Systems at the University of Arizona,

Tucson, Arizona, U.S.A. He is also the

Director of the Intelligent Systems and Decisions Laboratory and isaffiliated with the Institute of Automation, the Chinese Academy ofSciences, Beijing, China.His research interests include security informatics, infectious diseaseinformatics, spatio-temporal data analysis, software agents and theirapplications, computational support for auctions and negotiations, andrecommender systems. He has co-edited eight books and publishedmore than 80 peer-reviewed articles in Information Systems andComputer Science journals, edited books, and conference proceedings.He serves on editorial boards of eight Information Technology-relatedjournals. He is active in MIS and IEEE professional organizations andconference activities and is Vice President for Technical Activities forthe IEEE Intelligent Transportation Systems Society and Chair ofINFORMS College on Artificial Intelligence.

Dr. Hsinchun Chen is McClelland Profes-

sor of Management Information Systems atthe University of Arizona and AndersenConsulting Professor of the Year (1999). Heis also the founding director of the ArtificialIntelligence Lab, an internationally recog-nized research group in the areas of digitallibraries, intelligent retrieval, collaborativecomputing, knowledge management, medi-cal informatics, and security informatics. He received the B.S. degree from the National

Chiao-Tung University in Taiwan, the MBA degree from SUNYBuffalo, and the Ph.D. degree in Information Systems from the NewYork University.He is the author/editor of eighteen books and 150 SCI journal articles,as well as more than one hundred refereed conference articles coveringWeb computing, search engines, digital library, intelligence analysis,biomedical informatics, data/text/web mining, and knowledge man-agement. His recent books include: Medical Informatics: KnowledgeManagement and Data Mining in Biomedicine and Intelligence andSecurity Informatics for International Security: Information Sharingand Data Mining, both published by Springer. He serves on teneditorial boards including: ACM Transactions on InformationSystems, IEEE Transactions on Intelligent Transportation Systems,IEEE Transactions on Systems, Man, and Cybernetics, Journal of theAmerican Society for Information Science and Technology, DecisionSupport Systems, International Journal on Digital Library, and others.Dr. Chen is a Fellow of IEEE and AAAS, and has also receivednumerous awards in information technology and knowledge manage-ment education and research including: AT&T Foundation Award,SAPAward, the University of Arizona Technology Innovation Award,and the National Chaio-Tung University Distinguished AlumnusAward.