a meta-index for querying distributed moving object database servers

25
A meta-index for querying distributed moving object database servers Mauricio Marin a , M. Andrea Rodrı ´guez b, a University of Santiago of Chile, Chile b University of Concepcion, Chile article info Article history: Received 20 October 2008 Received in revised form 20 October 2009 Accepted 13 November 2009 Recommended by: F. Korn Keywords: Moving-object applications Distributed moving-object databases Spatio-temporal access methods abstract Distributed moving object database servers offer a feasible solution to the scalability problems of centralized database systems. In these potentially large-scale systems, querying about the time-varying location of specific moving objects can be particularly expensive in terms of running time. This work proposes a meta-index based strategy that can significantly speed up the processing of these queries. The meta-index acts as an entry point for spatio-temporal queries and quickly drives the search process to the database servers that contain solutions. It also enables very fast approximated solutions to queries such as top-k and spatio-temporal range queries. & 2009 Elsevier B.V. All rights reserved. 1. Introduction Moving object databases have attracted considerable attention in recent years due to developments in wireless and mobile technologies. There is a number of proposals for data models [12,36,9,8] and access methods [28,29,41] for handling the time-varying location of moving objects. Most of these strategies provide solutions for handling and processing queries about moving objects in an architecture with a single database server. For a large-scale data infrastructure using a single database server is becoming an impractical or unrealistic scenario. A system with dynamic data often needs to collect and store data in multiple disperse servers. Furthermore, the use of a global spatial data infrastructure implies the search for data in multiple and heterogeneous data sets, wherein each data set can be handled by a server with different software and hardware technologies. Such a distributed architecture is useful for applications including wireless sensor networks, stream data proces- sing, and RFID-enabled ubiquitous computing. In addition, the domain of applications can readily be extended by casting the main ideas in the settings in which the objects move around in a non-physical or virtual space. One specific such setting is the WWW where it is not uncommon to query a large set of proxy servers to obtain information about what IPs (moving objects) have visited a given collection of Web servers (virtual space) over time. In applications with distributed spatio-temporal data- base servers (we call them servers), some components of the system are responsible for distributing queries to the servers that may contain the desired data. These systems have typically answered classical time-slice and time-interval queries by defining a global and distributed spatio-temporal index that organizes servers and data in these servers in terms of spatial and temporal partitions [7,18,21,23]. They require a coordination of the data indexed at different levels of spatial or temporal granu- larity. These strongly coupled architectures limit the scalability and adaptability of the systems under changes Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/infosys Information Systems ARTICLE IN PRESS 0306-4379/$ - see front matter & 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2009.11.001 Corresponding author. E-mail addresses: [email protected] (M. Marin), [email protected] (M.A. Rodrı ´guez). Information Systems 35 (2010) 637–661

Upload: mauricio-marin

Post on 26-Jun-2016

232 views

Category:

Documents


0 download

TRANSCRIPT

ARTICLE IN PRESS

Contents lists available at ScienceDirect

Information Systems

Information Systems 35 (2010) 637–661

0306-43

doi:10.1

� Cor

E-m

andrea@

journal homepage: www.elsevier.com/locate/infosys

A meta-index for querying distributed moving objectdatabase servers

Mauricio Marin a, M. Andrea Rodrıguez b,�

a University of Santiago of Chile, Chileb University of Concepcion, Chile

a r t i c l e i n f o

Article history:

Received 20 October 2008

Received in revised form

20 October 2009

Accepted 13 November 2009

Recommended by: F. Kornthat can significantly speed up the processing of these queries. The meta-index acts as

Keywords:

Moving-object applications

Distributed moving-object databases

Spatio-temporal access methods

79/$ - see front matter & 2009 Elsevier B.V. A

016/j.is.2009.11.001

responding author.

ail addresses: [email protected] (M. M

udec.cl (M.A. Rodrıguez).

a b s t r a c t

Distributed moving object database servers offer a feasible solution to the scalability

problems of centralized database systems. In these potentially large-scale systems,

querying about the time-varying location of specific moving objects can be particularly

expensive in terms of running time. This work proposes a meta-index based strategy

an entry point for spatio-temporal queries and quickly drives the search process to the

database servers that contain solutions. It also enables very fast approximated solutions

to queries such as top-k and spatio-temporal range queries.

& 2009 Elsevier B.V. All rights reserved.

1. Introduction

Moving object databases have attracted considerableattention in recent years due to developments in wirelessand mobile technologies. There is a number of proposalsfor data models [12,36,9,8] and access methods [28,29,41]for handling the time-varying location of moving objects.Most of these strategies provide solutions for handlingand processing queries about moving objects in anarchitecture with a single database server.

For a large-scale data infrastructure using a singledatabase server is becoming an impractical or unrealisticscenario. A system with dynamic data often needs tocollect and store data in multiple disperse servers.Furthermore, the use of a global spatial data infrastructureimplies the search for data in multiple and heterogeneousdata sets, wherein each data set can be handled by aserver with different software and hardware technologies.

ll rights reserved.

arin),

Such a distributed architecture is useful for applicationsincluding wireless sensor networks, stream data proces-sing, and RFID-enabled ubiquitous computing. In addition,the domain of applications can readily be extended bycasting the main ideas in the settings in which theobjects move around in a non-physical or virtual space.One specific such setting is the WWW where it is notuncommon to query a large set of proxy serversto obtain information about what IPs (moving objects)have visited a given collection of Web servers (virtualspace) over time.

In applications with distributed spatio-temporal data-base servers (we call them servers), some components ofthe system are responsible for distributing queries to theservers that may contain the desired data. These systemshave typically answered classical time-slice andtime-interval queries by defining a global and distributedspatio-temporal index that organizes servers and data inthese servers in terms of spatial and temporal partitions[7,18,21,23]. They require a coordination of the dataindexed at different levels of spatial or temporal granu-larity. These strongly coupled architectures limit thescalability and adaptability of the systems under changes

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661638

in the configuration of servers and, in addition, thesupport for fault tolerance becomes critical. They are alsoless robust under incomplete information, which occurswhen servers are unable to register all locations of movingobjects along time.

This work proposes a decoupled architecture of serverswhose core is a meta-index structure. This meta-indexcontains summarized or coarse data stored in distributedservers, and is useful for giving approximated or exactanswers to different types of spatio-temporal queriesconcerning the past trajectory of moving objects. Thefocus of this work is on analyzing whether or not the useof such meta-index decreases the number of servers thatthe search process has to visit in order to obtain thedesired answer. We define a distributed update strategyfor the meta-index which is enhanced with a datacompression method to optimize the use of space andcommunication bandwidth. We also present bounds thatpredict the quality of approximated answers to queries.

Unlike previous works on distributed systems ofmoving objects, this work aims at solving not only time-

slice and time-interval queries, but also queries about thelocation of objects, both current and historical. Thelatter type of query imposes particular challenges forlarge-scale systems since no spatial range (window) canbe used as a filter for guiding the search process acrossservers. The location of the target object is unknownbefore or at the instant in which the query is executed.We are also interested in fast (though approximated)solutions of aggregation and trend-prediction queries thatprovide information about the evolution of objectmotion. In this case the envisaged application softwarecan display to the user a first glance of the solution whilstit goes for a more refined solution by contacting theinvolved servers.

Note that the techniques proposed in this paper can beused in combination with previous schemes that require amore structured spatio-temporal organization andcooperation of database servers. This paper does notpretend to solve or even enumerate all possible situationsarising in complex and large moving object systems, but itdoes study an alternative strategy whose suitability (tothe best of our knowledge) has not been previously testedin this context. We propose query algorithms andheuristics specifically tailored to the spatio-temporaldomain and requirements. We have implemented andtested them in demanding moving object systems whichinclude a novel very large scale application related to theWWW modeled as a virtual space in which movingobjects are normal user queries and the space is given byWeb domains.

Database servers are required to keep track of theevents associated with the moving objects within somegeographic or virtual scope. To keep the meta-indexproperly updated we use a crawler whose robots periodi-cally contact servers to get a subset of these events. Therobots are threads located at the crawler side and they arescheduled to establish concurrent connections withservers to download data. Thus what is required for everydatabase server is the support of a common interface inorder to service every visit of the robots. Data transfers

take time so the quality of the information held in themeta-index crucially depends on the robots ability to visitthe most relevant servers first. Relevance is defined by acombination of measures of how active objects are andhow many they are in each server.

In [20] we proposed a set of rules to dynamically rankservers and tested their suitability using a single butdemanding synthetic system. We have also studiedcrawling and indexing in the context of P2P environmentsin [14,15]. The present paper focuses on the centralizedquery processing in the meta-index, which resembles theservices provided by large Web search engines.

The query types we solve by using the meta-indexproposed in this paper are as follows:

Object-location queries: These are queries—about thelocation of a particular object at a given instant or intervalof time—for which the meta-index allows the querysolver to quickly get to the servers where the exactanswer can be computed. An alternative to the meta-index would be to broadcast queries to all servers. Clearly,this kind of strategy is not scalable, wastes bandwidth anddoes not enable the additional queries supported by themeta-index. The meta-index combines statistical andcoarse trace data. For instance, in the benchmark systemsused in this paper, the statistical data are the number ofobjects per type in specific areas and time instants, andthe coarse traces represent lists of servers visited byobjects, that are not necessarily in the temporal order inwhich objects visited the servers. We have found that thiscombination of statistical and trace data performs betterthan other alternative designs. Certainly the particularapplication domain can enable the use of a more refinedstatistical and trace data tailored to the application. Thesestatistics could improve the effectiveness of the searchoperations implemented on top of the proposed meta-index. Nevertheless, we show that with little informationstored in the meta-index, good performance can beobtained.

Top-k queries: an example of these queries is to find thetop-k servers or regions with the largest number ofmoving objects in a given time instant. For this type ofqueries, the meta-index represents a sample of the wholedistributed system and, as such, allows the query solver tofind very fast approximated answers to queries concern-ing the overall evolution of the system. The quality of thesample is directly related to the speed and focus of thecrawling. For the highly dynamic benchmark systemsused in this paper, we show that our crawling and meta-index scheme allows top-k and trend-prediction queriesto be solved in a manner that matches well observationsin the real system.

Spatio-temporal range queries: spatio-temporal rangequeries are the classical time-slice and time intervalqueries that retrieve objects that were in a rectangularspatial region at a given time instant or time interval,respectively. Unlike previous works that solve suchqueries by using a global and distributed spatio-temporalindex, we use a two-level architecture that maintainsindependent local indexes in each server and uses themeta-index as the entry point to direct the process tothe right servers covering the spatial range of queries. The

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 639

meta-index provides information about which servers arein charge of which regions. This information can bedynamically inferred from the data stored in the meta-index, which can also be used to provide fast approxi-mated answers to this kind of queries.

For the sake of simplicity, all descriptions of algorithmsand heuristics are given in the context of the systems usedin our empirical studies. These systems are very simplebut representative of the essential requirements. Exten-sions to more complex systems should be straightforwardfrom the descriptions. We validate our proposals with anexperimental evaluation which combines updates ofmeta-index and query processing by using both well-known spatio-temporal workloads and a real applicationof moving objects where physical or geographic consid-erations do not apply. For this last application we alsodescribe a practical parallelization of the meta-indexstrategy as understood in a centralized data center wherethe meta-index is distributed on a set of processorsforming a cluster of computers.

We emphasize that our proposal is intended for verylarge scale systems in which there is no cooperationamong servers. Unlike previous strategies, the meta-indexscheme provides efficient support for location, top-k andrange queries, and does not require servers to commu-nicate with each other on some specific communicationtopology. Objects can move arbitrarily to any serverduring their lifetimes and can appear/disappear in anyof them at any time. Servers can go down intermittentlyand new ones can be incorporated dynamically. We aim ata meta-index design that provides scalability, adaptabilityand proper support for fault tolerance in these systems.

The remainder of this paper is organized as follows:Section 2 describes related work in the area of spatio-temporal query processing and distributed systems,Section 3 describes the meta-index in terms of itsstructure, search algorithms, and strategies for datacollection and updates, Section 4 presents an extensiveevaluation of the proposed strategies and, finally,Section 5 presents conclusions and outlines directionsfor future work.

2. Related work

This section reviews previous work on indexing datastructures for processing historical (as opposed to future)spatio-temporal queries and on distributed systems forspatio-temporal databases.

2.1. Spatio–temporal query processing

Before looking at previous work, recall that the threetypes of spatio–temporal queries for historical positions ofmoving objects include [31]:

Coordinate-based queries, which are the traditionalspatio-temporal range, also known as time-slice andtime-interval queries, and nearest-neighbor queries.Examples are find objects or trajectories in a region at

a particular time instant or during some time interval.

Another important example of coordinate-basedqueries is find the k-closest objects with respect to a

given point at a given time instant.

� Trajectory-based queries, which involve topology

of trajectories (e.g., overlap and disjoint) and informa-tion (e.g., speed, area, and heading) that can bederived from the combination of time and space. Anexample of such queries would be find objects or

trajectories that satisfy a spatial predicate (eg., overlap,

meet, and disjoint) at a particular time instant or time

interval.

� Combined queries, which involve a selection of

trajectories and a selection of parts of trajectories.For example find the trajectory of a particular object at a

given time instant or time interval. In this work we callobject-location queries the combined queries about thelocation of a specific object at a given time instant ortime interval.

A number of spatio-temporal access methods havebeen recently proposed in the literature. Most of them aredevised to answer time-slice and time-interval queries.They differ in the way they incorporate time in a spatial-oriented data structure: time as another dimension [46],overlapping or evolving structures [25,24], multiversions[42,40,41], and methods based on snapshots andevents [11].

Concerning trajectory-based queries, access methodscan be seen as trajectories on spatial line segments intime. Thus, extensions to classical spatial access method,such as the R-tree [13], can be used [30,29]. Trajectory-oriented access methods have addressed time slice andtime interval queries with respect to trajectories; forexample, find all trajectories that cross a spatial window at a

time instant or time interval. In this context, the scalable

and efficient trajectory index (SETI) [3] is an indexingstrategy for managing a large number of trajectories. SETIdecouples the spatial and temporal dimensions, since itconsiders that the spatial dimension of a trajectorychanges slowly in comparison with the continuouslyincreasing time dimension. Then it uses an indexing datastructure to partition the spatial dimension statically.Within each partition, another data structure indexes the1D (time) dimension.

There are also related works concerning spatio-tem-poral aggregations [27,43] and querying imprecise data inmoving object databases [4]. For very large spatio-temporal datasets, spatio-temporal applications mayrequire summarized results instead of information aboutindividual objects. The RB-tree [43,48] is an aggregatedspatio–temporal indexing data structure that can answerqueries about the number of objects in a spatio-temporalwindow. Considering the uncertainty nature of the exactlocation in moving object environments, some studiesdefined uncertainty spatio-temporal queries [4,48]. In [4],range and nearest-neighbor queries result in a set ofobjects with their corresponding probability estimates ofthe validity of the answer, which is defined in terms of anuncertainty model of the possible locations of objectsusing line-segment or free movements. In [48], sixBoolean predicates re-defined a range query for objects

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661640

with uncertainty trajectory: possibly_sometime_inside,possibly_always_inside, always_possibly_inside, definitely_always_inside, definitely_sometime_inside, and sometime_definitely_inside. Focusing on probabilistic-range retrieval,the work in [38,39] defines an indexing data structurethat uses a set of heuristics over a 2D probabilisticconstrained data to prune and validate possible queryanswers.

Although our work concerns spatio-temporal queriesabout the past trajectory of objects, a related work thatoptimizes the maintenance of continuous predictivequeries about moving objects is [6]. In this work, triggersin an object-relational moving object database activatethe re-evaluation of range and k-nearest neighbor querieswhen bulk updates of objects’ trajectories occur. Theypropose techniques that combine the updates to thetrajectories with the updates to the query answers. Thiscould be useful if we consider that queries for the pasttrajectory of moving objects can be re-evaluated withcontinuous updates to the meta-index. Along the samelines, the work in [16] proposes maintenance algorithmsfor k-NN (nearest neighbors) and spatial join queries oncontinuously moving points that support updates, andthe work in [32] analyzes query indexing andvelocity constrained indexing for the efficient and scalableevaluation of multiple continuous range queries onmoving objects.

To summarize, there are many approaches for spatio-temporal indexing data structures, whose performancesdepend on the type of queries they are designed to solve.In addition, aggregated queries and uncertaintyspatio-temporal queries are useful in the context ofspatio-temporal applications where models and indexingdata structures have been defined. In contrast to previousstudies, we are proposing in this paper an indexing datastructure that enables answering different types of spatio-temporal queries (from range to aggregate queries) usingexact or approximated answers. Our approximatedanswers are enriched with bounds used to provideestimates of the quality of answers.

2.2. Distributed systems

To the best of our knowledge, not much researchhas been conducted on the kind of systems and restric-tions we are considering in this paper. Advancesin this direction use distributed spatial indexing datastructures, such as the Quadtree and R-tree structures[34,22,37].

Regarding distributed and heterogeneous databases,the work in [35] proposes a distributed query plan forqueries about the location of moving objects withpredefined paths and schedules. Such query plan is basedon a new deviation filter that exploits the temporal andspatial characteristics of the source in order to improveselectivity. The work in [7] proposes a Web architecturefor scalable moving object servers. They have come to theconclusion that it is useful to divide the space into gridcells, so that each location of the space is contained inexactly one grid. Their work investigates how to answer

queries, in which an object enters, crosses or leaves aquery window. Other works that use partitions of thespace are [49,21]. They propose indexing data structuresat two different levels: a low level for movements ofobjects and a high level that organizes the grid ofpartitions. These studies based on partitions, however,have addressed range queries (i.e., time-instant and time-interval queries) but not queries about locations ofparticular objects.

MobiEyes is a real-time location monitoring system forprocessing moving queries over moving objects in a mobileenvironment [10]. MobiEyes uses the computationalpower of the moving objects, reducing the communica-tion cost by reducing the number of messages, as well asthe server load, when compared with solutions based on acentral processing of information. Important assumptionsof this work which make it different from the techniquesdescribed in this paper are: moving objects have syn-chronized clocks, moving objects have computationalcapabilities to carry out some tasks, moving objects candetermine their velocity vector, and moving objects areable to locate their positions.

The work in [18,23] describes an architecture namedgracefully aging location information system (GALIS). GALISis a cluster-based distributed computing system archi-tecture, where several data processors are dedicated tokeep data associated with different geographic and timezones. This work proposes a data structure that handlesthe location of moving items in a short-term location data

system (SLDS) and long-term location data system (LLDS).Such systems are organized by space partitions (macroand micro cells) and a time partition into time zones. Thequery processing involves distributed computing opera-tions of multiple nodes under a schema of global time

based coordination [17]. The results reported for thisarchitecture focus on showing the advantages of usingdistributed processing. The evaluation of the queryprocessing considers only time-slice and time-rangequeries, and not trajectory-based or combined queries.In addition, the architecture assumes a coordinationbetween the SLDS and LLDS, whose cost and scalabilityare not analyzed.

A recent work that addresses the problem of efficientaggregation of answers for distributed processing of rangequeries is presented in [47]. This work assumes grid-likedistribution of database servers. The main idea is that,since a query may span several cells, an efficient strategycan aggregate and sort, if needed, partial answerstransmitted along a path that involves neighboring cells.

In summary, previous works on distributed movingobject applications have typically addressed time-sliceand time-interval queries (range queries), and haveproposed architectures that rely on some grid-likestructures partitioning the spatial region of interest,and/or coordination among the moving objects anddistributed servers. The proposal in this paper, in contrast,pursues a flexible and scalable architecture that enablesthe independence of servers, and answers not only spatio-temporal range queries, but also aggregated queries andqueries about the location of particular objects (i.e.,object-location queries).

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 641

3. Meta-index definition

A fundamental element of the system proposed in thispaper is the meta-index structure. The meta-index is anindexing data structure that guides the search process, isstored separately from the servers data, and handlespartial data contained in database servers. We proposehere a meta-index called statistics- and trace-based meta-

index (STM-index), which is designed to solve time-sliceor time interval queries, and queries about the location ofa particular object at a given time instant.

The goal of the meta-index is to reduce the queryresponse time when compared against two alternativestrategies that do not make use of a meta-index structure.The first one effects brute-force or random accesses toservers until finding the desired object at query time. Thesecond one broadcasts the query to all servers. Intuitively,in both situations, the query can be solved in a reasonablerunning time for a sufficiently small number of serversand good bandwidth, but there is no benefit from theadditional (approximated) queries that can be solvedusing the sample of the whole system stored in the meta-index. In general, to process approximated answers tospatio-temporal queries, we only need to search in theSTM-index whereas, for more accurate answers, we firstneed to search in the STM-index and then in the localservers.

Formally, we define the system composed of distrib-uted moving object database servers in the form ofS ¼ ðO; T; S;MIÞ, where O is the set of all moving objectsalive in the system during some time interval, T is the setof types to which objects are assigned, S is the set ofgeographically distributed database servers that haveexisted, and MI is the meta-index. A type here is asemantic class of objects, such as a truck, motorcycle, cars,buses, or train, whose number and semantics depend onthe application and are known by the whole system. Weassume that each object and each server has a uniquesystem-wide identifier. We do not impose restrictions onthe type of communications between moving objects andservers neither on the computational capabilities ofmoving objects.

The remainder of this section describes the datacontent, data structure, search algorithms, and use ofthe STM-index for processing different types of queries.

3.1. Meta-index data content

We describe here two basic types of data stored in themeta-index: statistical (aggregated) data and coarsetraces.

3.1.1. Statistical data

The meta-index stores the following summarized dataper server in the form of: (1) ðt;p;nÞ, where n is thenumber of objects per type p at time instant t, and (2)ðt;mbrÞ, where mbr are two extreme and opposite pointsdefining the minimum bounding rectangle (MBR) contain-ing all objects (of all types) that have been stored in aserver until time t. Handling summarized data is

equivalent to handling aggregated data, something al-ready discussed within the context of historical spatio-temporal aggregation queries [27,43].

In an ideal scenario, the meta-index contains summar-ized data for each time instant. In a realistic scenario, thesystem can only collect data at a finite number of timeinstants, and this number depends on the strategy used toproperly maintain the meta-index up to date. Actualsystems have latency times, hence transferring informa-tion from one point to another in the interconnectingnetwork takes time.

We also explore and compare an alternative schemewhich is based on the idea of keeping statistical dataabout the number of objects per server in the meta-index.We call these data non-historical statistical data. In thiscase, the meta-index stores a normalized average numberof objects (AvNO) per type and server up to the last datacollection from the server. More precisely, let fT ðs; tiÞ bethe number of objects of type T in server s at time ti, AvNOfor server s between time interval ½0; . . . ; tn� is given by

AvNOðsÞ ¼X

ti ¼ 0...tn

ðti-ti-1Þ � fT ðs; tiÞ=tn: ð1Þ

When using non-historical statistical data for queryprocessing, answering queries does not take into con-sideration the query time.

3.1.2. Trace data

The meta-index also stores object traces, whichrepresent subsets of object locations (server ids) alongtime. These data give the search algorithm, at the veryleast, a hint of where a particular object might have beenand direct the algorithm to start the search from thatpoint onwards. Unlike trajectory data, trace data for eachserver indicate a sequence of server contents or snapshotswhere each snapshot registers the objects that happen tohave been in the server sometime in the past and thatwere not registered by previous snapshots. That is, tracedata do not necessarily indicate the actual time whenan object was at a specific location (server), but thetime when an object was detected in the data stored inthe server and reported to the crawler. As we will show inthe evaluation of the meta-index update strategy (seeSection 3.4), using traces instead of trajectories allows usto capture data about more objects in the meta-index,increasing the performance in the search process con-cerning queries about the location of specific objects atparticular time instants.

The meta-index needs to balance the cost and preci-sion of data replication. To avoid complete replication, themeta-index deals with sparse instead of complete traces. Asparse trace means that we may miss information of allservers where an object was, but more importantly, weonly know that an object was at some time instant withinthe geographic extent of servers that are in the object’strace; however, the meta-index does not tell us the exacttime instant (or interval) when this happened. Forexample, consider the trajectory of three objects o1, o2,and o3 shown in Fig. 1a. Using a server-basedrepresentation of locations, a trajectory is representedby a list of tuples of the form ðsi; ½tj; tk�Þ, where ½tj; tk� is the

ARTICLE IN PRESS

Traj. Data representation

o1 (s1 , [t1 , t 3)) , (s10 , [t3 , t 6)) , (s12 , [t6 , t 7)) ,(s16 , [t7 , t 10 )) , (s18 , [t10 , t 11 )) , (s13 , [t11 , t 12 )) ,(s9 , [t12 , t 13 )) , (s8 , [t13 , t 14 )) , (s7 , [t14 , t 16 ))(s6 , [t16 , t 17 )) , (s14 , [t17 , t 19 )) , (s15 , [t19 , t 23 ])

o2 (s2 , [t1 , t 2)) , (s9 , [t2 , t 3)) , (s10 , [t3 , t 7)) ,(s12 , [t7 , t 8)) , (s16 , [t8 , t 12 )) , (s18 , [t12 , t 18 )) ,(s19 , [t18 , t 21 ])

o3 (s16 , [t5 , t 8)) , (s13 , [t8 , t 13 )) , (s8 , [t13 , t 15 )) ,(s7 , [t15 , t 19 )) , (s6 , [t19 , t 20 )) , (s5 , [t20 , t 22 )) ,(s4 , [t22 , t 24 ])

Fig. 1. Object trajectory: (a) graphical representation and (b) server-based representation of objects (the parenthesis ‘‘]’’ indicates inclusion of the time

instant in the interval).

Collection Objectss1, t 1 o1

s10 , t 3 o1, o2

s12 , t 6 o1

s16 , t 8 o1, o2, o3

s13 , t 12 o1, o3

s18 , t 17 o1,o2

s7, t 22 o1,o3

Object Traceo1 (s1, t 1), (s10 , t 3), (s12 , t 6), (s16 , t 8),

(s13 , t 12), (s18 , t 17), (s7, t 22)o2 (s10 , t 3), (s16 , t 8), (s18 , t 17)o3 (s16 , t 8), (s13 , t 12), (s7, t 22)

Fig. 2. Object trace: (a) data collection and (b) trace data.

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661642

time interval when the trajectory was stored in server si

(Fig. 1b).For this example, we assume that the trace data in the

meta-index are obtained when robots visited server s1 atinstant t1, s10 at t3, s12 at t6, s16 at t8, s13 at t12, s18 at t17,and s7 at t22 (Fig. 2a). At these visits, the robots capturedata about all objects that have been in a server since thelatest visits to this server. As a result, the meta-indexcontains the trace data shown in Fig. 2b, which representsa subset of the objects’ trajectories. Unlike the completetrajectory information, trace information includes onlythe server and the time instant of the data collection.

Note that some of the traces recovered from a servermight be of objects whose locations are no longerregistered in the server at the time of the data collection.Moreover, in theory and due to parallel data collection,the traces recovered from different updates may indicatethat an object was in different servers at the same time.The non-historical alternative for the trace data in themeta-index consists of the last location of an object foundin a meta-index update; however, this location may notbe the current location of the object at the time of themeta-index update.

3.2. STM-index data structure

Statistical data in the meta-index are time-varyingnumbers of objects per type and geographic extents ofservers. Queries on these data return the number of objects

of a specific type or geographic extent of a server at a timeinstant closest to a query time. Formally, we define thequeries over the statistical data in the form of: (1) functionObjectsðs; p; tÞ returns the number of objects of type p inserver s at the closest time instant to query time t, if nostatistical data for this server exists, it returns 0; (2)function Extentðs; tÞ returns the MBR of the geographicextent of server s at the closest time instant to query time t,if no statistical data for this server exists, it returns anunknown (null) extent.

To solve queries about statistical data, we use here atable with direct addressing in terms of servers’ ids, whereeach entry contains a time-based ordered list of summar-ized data. Each node of this list contains the number ofobjects per type and the geographic extent at the momentof data collection. Recall that the geographic extent of aserver at an instant t is the geographic extent containingall objects that have been stored in the server since thelast data collection until the time instant t. The queryprocessing relies on a binary search of the query time inthe temporal ordered list.

Note that a more sophisticated data structure (spatio-temporal data structure) could be considered to organizethe statistical data in terms of the time-varyinggeographic extent of servers. Such spatio-temporal datastructure could help answer range queries, since onlythe statistical data of a selected number of serversare required in these queries. This can improve theperformance of the meta-index, but it does not affectthe quality of results neither the number of local servers

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 643

visited during the query processing. Nevertheless, thenumber of servers and number of data collections perserver are significantly smaller than the number ofmoving objects and their real trajectories. Thus the costof checking the complete table of servers (in memory) isnot relevant with respect to the total cost of processingthe query on the meta-index and local servers.

We store the trace data in another data structure. Thisstructure is a hash table that uses the object ids as searchkey. For each object id in the table, there exists a time-based ordered list of locations (i.e., a list of server ids) perobject. Note that although we could derive the number ofobjects and geographic extents per server at different timeinstances from the trace data, statistical data are stored ina separate data structure due to efficiency considerationsof query processing. A query over the trace data in themeta-index returns the server where the object’s tracewas registered at the closest time instant to the querytime. Formally, a query over the trace data is of the formTraceðo; tÞ, which returns a server s stored in the trace ofobject o at the closest time to t. If no trace of objects o

exists in the meta-index, it returns null. The time-basedordered list enables a binary search (with logarithmiccost), decreasing the search time in the list.

An important characteristic of this meta-index is thateven if the meta-index is up-to-date for a query, that is,data have been collected from all servers at the time of thequery, we may have objects whose traces have never beenstored in the meta-index. We refer to these objects asmissing objects or MO. This is because only a subset ofservers at certain time instants are visited for the meta-index update strategy. In Section 3.4 we discuss issues oflatency and data transfer times that lead to thisrestriction.

3.3. Query processing

In this section we first describe the way the meta-index enables the processing of different types of queries.Then we describe in detail the algorithms of object-

location, spatio-temporal range, top-k and nearest neighbor

queries. These algorithms reflect the potential use of thecomplete meta-index in combination with the search inlocal servers.

To clarify the main ideas, we will use the example inFig. 1 when defining different types of queries. Forsimplification, we assume here only a single type ofobject so the type ID is omitted in the index structure.The STM-index of the example contains the data shown in

Sever Time # UL LRs1 t1 1 (1.25, 1.75) (1 .5, 1.0)s7 t22 0 (3.75, 2.75) (5 .75, 1.25)s10 t3 2 (1.5, 3.25) (2 .25, 2.0)s12 t6 1 (1.0, 4.25) (1 .25, 3.25)s13 t12 1 (2.75, 4.5) (3 .5, 2.75)s16 t8 2 (1.0, 5.25) (3 .25, 4.0)s18 t17 1 (3.0, 5.5) (5 .25, 4.5)

Fig. 3. Meta-index: (a) statistical data (UL: up-left point of the ex

Fig. 3. Note that, although we include in the trace data andin the geographic extent of a server all objects that havebeen in the server since the last data collection in thatserver, the number of objects stored as statistical data arethe actual number of objects at the time instant of thedata collection. Thus, although we update the trace ofobjects o1 and o3 in the data collection at time t22 fromserver s7, no objects are found in that server at time t22.

3.3.1. STM-index query support

Since the meta-index by itself contains an approxi-mated characterization of the system, it can provide datafor approximated or exact answers to different types ofqueries. On the one hand, data in the meta-index areapproximated answers to several queries. Such approxi-mated answers are relevant in contexts where we needfast answers and in which the possibility of accessing thereal database servers may be temporarily difficult becauseof servers and network saturation. On the other hand, theapproximated data in the meta-index can guide thesearch to local servers to get exact answers for object-location queries and for range, top-k, and NN queries notnecessarily exact but good approximated answers.

In the following, we discuss how the meta-index helpsanswering the different types of queries studied in thispaper. The experimental evaluation in Section 4 showsthat the meta-index is useful for solving those queries inan efficient manner both in terms of running time andmemory space. For exact answers to queries the meta-index effectively reduces the number of servers that arecontacted to get the exact results. For approximateanswers the meta-index is capable of quickly producingfairly good results as compared to the exact answers forthe same queries.

Definition 3.1 (Object-location queries). Let o be the id ofan object of type p and t be a time instant. An object-location query Lsliceðo; p; tÞ returns the location ðx; yÞ 2 R2

where object o is located at time instant t. This type ofquery can be extended to time interval queries of the formLintervalðo; p; t1; t2Þ ¼ fðx; yÞjðx; yÞ 2 R

24Lsliceðo; p; tÞ4t1rtrt2g.

An approximated answer to Lsliceðo;p; tÞ is a servers¼ Traceðo; tÞ. If s¼ nil, then the approximated answerreturns a set fsjs 2 S4Objectsðs; p; tÞ ¼Max8s02SObjects

ðs0; p; tÞg. Approximated answers can be complementedwith quality bounds that estimate the validity of theanswer. The refined answer uses data in the meta-index toguide the search to the local servers. In these servers, the

Object Traceo1 (s1, t 1), (s10 , t 3), (s12 , t 6),

(s16 , t 8), (s13 , t 12), (s18 , t 17),(s7, t 22)

o2 (s10 , t 3), (s16 , t 8), (s18 , t 17),o3 (s16 , t 8), (s13 , t 8), (s7 , t 22)

tent, LR: low-right point of the extent) and (b) trace data.

ARTICLE IN PRESS

Fig. 4. Minimum and minmax distance between MBRs.

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661644

exact answer is found. Further details on how to processthis type of queries are given in Section 3.3.3.

As an example, consider the STM-index in Fig. 3 and aquery Lsliceðo1;a; t5Þ, with a being a constant object type.The approximated answer to this query is s10, which inthis case is the correct server from which it is possible toextract the exact location of object o1 at time t5.

Definition 3.2 (Spatio-temporal range queries). Let w be aspatial region (query window) defined by two oppositeextreme points, p 2 T be the type of objectswe are searching for, and t be a query time. A time-slicequery Qsliceðw;p; tÞ returns the set of objects of type p thatare inside region w at time instant t. This type of querycan be extended to time interval Qintervalðw; p; t1; t2Þ ¼

fojo 2 O4Qsliceðw; p; tÞ4t1rtrt2g.

An approximated answer to Qsliceðw; p; tÞ returns a set ofobjects fojo 2 O 4s¼ Traceðo; tÞ4IntersectsðExtentðs; tÞ;wÞg,where Intersectsðx; yÞ is a predicate that is true if x

intersects y. The refined answer searches locally in allservers s whose Extentðs; tÞ intersects w, filtering out allobjects whose locations at time t were outside the querywindow w. Thus, the meta-index serves as a filter thatavoids the exhaustive search in all servers; however, it isan approximation when the spatial region assigned toservers is not constant, since the geographic extent of aserver depends on the time of data collection. Furtherdetails on how to process this type of queries are given inSection 3.3.4.

As an example of a range query with the STM-index inFig. 3, the approximated answer to query Qsliceð½ð1:0;5:0Þ;ð3:0;4:0Þ�;a; t8Þ is fo1; o2; o3g, because these objects weredetected in server s16 at the closest time to t8, and thegeographic extent of s16 at that time intersects the querywindow [(1.0,5.0),(3.0,4.0)]. The algorithm searches in s16

and detects that only o1; o2 were inside the query windowat time instant t8.

Definition 3.3 (Top-k queries). A top-k object-serverquery KOðt; p; kÞ returns the k servers with the largestnumber of objects of type p at query time t. Similarly, thetop-k extent-server query KEðt; kÞ returns the k serverswith the largest geographic extent at time t.

An approximated answer determines, for each server s,Objectsðs; p; tÞ or Extentðs; tÞ. Using these values, it sortsservers in decreasing order in terms of the number ofobjects or geographic extent. A refinement of this answeris to select from the ranked list of servers those located inthe top-2k positions and then, re-calculate the rankingwith the real data obtained from the local servers. Usinglocal search improves the quality of answers as opposedto a search that only uses the meta-index. But still, theseanswers are approximations. Further details on how toprocess this type of queries are given in Section 3.3.5.

As an example with the STM-index in Fig. 3, theapproximated answer to query KOðt8;a;2Þ is fs16; s10g, andits refined answer that checks in the 2k local servers (i.e.fs16; s10; s12; s13g) returns fs16; s13g.

Definition 3.4 (Nearest neighbor queries). Let o 2 O and t

be the query time. A nearest neighbor query NN ðo; p; tÞ

returns objects that are closest to object o of type p at timeinstant t.

An approximated answer NN ðo; p; tÞ returns the set ofservers that may contain the closest object to object o

located in server s¼ Traceðo; tÞ, with sanil. If s¼ nil, weselect a s 2 S such that Objectsðs; p; tÞ ¼Max8s02SObjects

ðs0; p; tÞ. To derive the set of servers that may contain theclosest object, we use part of the strategy defined in [33]that calculates the minimum and minmax distances ofany server s0 to server s at the closest time to t.

Minimum distance is the minimum Euclidean distancebetween sides of the MBRs defining the geographic extentof servers. Minmax distance is the maximum Euclidiandistance between the closest sides of MBRs (Fig. 4). If twoMBRs intersect, we consider their minimum distanceequal to zero and the minmax distance is the minimum ofthe minmax distances between intersecting sides.

Let MINðmbr0;mbrÞ be the minimum distance betweenthe geographic extents of servers s0 and s at time instant t.The approximated answer to NN ðo; p; tÞ is a set ofservers fs0js0 2 S4Traceðo; tÞ ¼ s48s0 02S:ðMINðExtentðs0; tÞ;

Extentðs; tÞÞ4MINðExtentðs0 0; tÞ; Extentðs; tÞÞÞg. This answerwill always include the server where the object wasdetected, but also all other servers intersecting it. LetCSðo; tÞ be a set of ‘‘selected servers’’ with respect to objecto and query time t. This set contains servers that potentiallystore the closest object to o if indeed object o was in server s.Then, the refined answer searches for closest objects in allservers s0 2 CSðo; tÞ ordered by minimum distance. Theprocess stops when it finds the closest object in a currentserver and this object’s distance to object o is shorter thanthe minimum distance between the server of o and the nextserver in the ordered list of ‘‘selected servers’’. The algorithmto process this type of queries is given in Section 3.3.6.

As an example with the STM-index in Fig. 3, consider thequery NN ðo2;a; t9Þ. First, o2 is detected in server s16 at timet9, Traceðo2; t9Þ ¼ s16. Then, the approximated answer to thequery is the set of closest servers CSðo2; t9Þ ¼ fs13; s16; s12g

(Fig. 5). To obtain a refined answer, the process checks inthe set of ‘‘closest servers’’ and finds that object o2 was ins16 at time t9 and, at that moment, the closest object to o3

was object o1, which was also in server s16.

3.3.2. Predicting the quality of approximated answers

to queries

We define low quality bounds for each type ofapproximated queries. They are predicted using the data

ARTICLE IN PRESS

Fig. 5. Example of servers’s geographic extents at time instant t9.

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 645

stored in the meta-index, in particular, the rate of move-out events observed in the servers during a given periodof time. Therefore they represent an approximation to thequality of results since the number of events registered byservers depends on the speed of crawling, namely itdepends on how often the crawler visits the servers tocollect data from them. We introduce the followingdefinitions.

Definition 3.5 (Average mobility rates). Let ðt1; . . . ; tn) bethe ordered list of time instances when data crawlingoccurred in server s, t0 be the initial time of the system,and routðs; ti; tjÞ be the number of move_out events thatoccurred in s between ti and tj. The average rate ofmove_out events in server s is

AvgMoutðsÞ ¼Xtn-1

ti ¼ t0

routðs; ti; tiþ1Þ

ðtiþ1-tiÞ�ðtiþ1-tiÞ

ðtn-t0Þ

¼Xtn-1

ti ¼ t0

routðs; ti; tiþ1Þ

ðtn-t0Þ: ð2Þ

To illustrate the use of this rate, consider the example inFigs. 1–3. In this example, server s16 was visited by onlyone robot at time instance t8. At that time, three move_inevents occurred for objects o1, o2, and o3, but only one ofthese objects left before or at time instant t8 (i.e. onemove_out event). In this situation, the average mobilityrate for server s16, AvgMoutðs16Þ, during the time period ofthe system and assuming that ti ¼ i, is

AvgMoutðs16Þ ¼1

t8-t0�

t8-t0

t8-t0¼

1

8¼ 0:125:

Definition 3.6 (Reduction bound). Let t0 be the closesttime to t (i.e., t0 either t0rt or t04t) when a robot visitedserver s, and let Objectsðs; tÞ be the number of objects inserver s at time t0. A reduction bound (minimum bound) of

the number of objects that were detected at time t0 in s

and remain in s at time t is defined as

Qðs; tÞ ¼ ðObjectsðs; tÞ-AvgMoutðsÞ � jt-t0jÞ: ð3Þ

Continuing with the example in Figs. 1–3, the reductionbound for server s16 at time instant t9 is

Qðs16; t9Þ ¼ ðObjectsðs16; t9Þ-AvgMoutðs16Þ

� jt9-t8jÞ ¼ 2-18¼ 1:875:

As defined above, Extentðs; tÞ is the geographic extentstored in the meta-index for server s at the time closestto t, Objectsðs; tÞ is the number of objects stored in the meta-index for the server s at the time closest to t, and Traceðo; tÞ

the server that the meta-index retrieves as the location ofobject o at time t. For simplicity we assume here that allobjects are of the same type. We now introduce qualitybounds for each of the approximated queries.

Definition 3.7 (Object-location uniform quality bound). Lets¼ Traceðo; tÞ and S be the set of all servers in the systemat time t. A quality bound to the approximated answer ofan object-location query Lsliceðo; p; tÞ is

QLðo;tÞ ¼

Qðs; tÞObjectsðs; tÞ

if o has a trace in the meta� index;

Qðs; tÞPs2SObjectsðs; tÞ

otherwise;

8>>><>>>:

ð4Þ

whereQðs; tÞ is the reduction bound at the closet time t0 tothe respective query time t.

This bound is valid for fairly uniform systems as thisassumes that all objects in server s have the sameprobability of being one of the n objects leaving theserver with n¼ AvgMoutðsÞ � jt-t0j. For systems in whichthe speed of objects is non-uniform, we can assume thatthe above bound QLðo;tÞ holds only for the slowest objectsin the system and that a location query for any object thatis faster than them must have a bound with smaller valuethan QLðo;tÞ. In this case, by speed we mean the rate atwhich a given object moves in time from one server toanother.

If we assume that the objects move uniformly acrossservers, the object-location uniform quality bound ofobject o2 at time instant t9 in the example of Figs. 1–3,with Traceðo2; t9Þ ¼ s16, is

QLðo2 ;t9Þ ¼Qðs16; t9Þ

Objectsðs16; t9Þ¼

1:875

2:0¼ 0:94:

From the data stored for each object in the meta-indexit is possible to estimate the rate RsrvðiÞ at which eachobject i moves from one server to another. These valuescan be used to calculate the bound QLðo;tÞ for non-uniformsystems where the relative speed of objects can be verydifferent from one another. The slowest objects i with thesmallest values RsrvðiÞ are assigned to the above boundQLðo;tÞ. Let assume that the average speed of these objectsis Rmin and the fastest objects have on the average a speedRmax. These two values can be determined experimentallywhere a slow object i can be considered as such if its RsrvðiÞ

value is below a fraction of the average Rsrv. The oppositecan be done for determining fast objects, and all these

ARTICLE IN PRESS

Fig. 6. Range query: (a) window of a range query and (b) the geographic extent of server s16 in the meta-index in comparison to the query window.

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661646

values can be used to determine the average Rmin and Rmax

values (all this is a system dependent tuning).

Definition 3.8 (Object-location non-uniform quality

bound). Let rðoÞZ1 be a further reduction factor whichis a function of the observed rate RsrvðoÞ of server changefor object o. The non-uniform bound Q�Lðo;tÞ is defined as

Q�Lðo;tÞ ¼Objectsðs; tÞ-AvgMoutðsÞ � jt-t0j � rðoÞ

Objectsðs; tÞ; ð5Þ

where

rðoÞ ¼ 1þRsrvðoÞ-Rmin

Rmax-Rmin

� ��

Objectsðs; tÞ

AvgMoutðsÞ � jt-t0j-1

� �ð6Þ

or equivalently

Q�Lðo;tÞ ¼ 1-RsrvðoÞ-Rmin

Rmax-Rmin

� �� ��QLðo;tÞ: ð7Þ

The expression (6) is obtained by considering that thevalues of r can vary Q�Lðo;tÞ between QLðo;tÞ and 0, namely1rrrm=n with m¼Objectsðs; tÞ and n¼ AvgMoutðsÞ �

jt-t0j , and that there is a linear relationship between thevalues of RsrvðoÞ and rðoÞ which is bounded by therespective max and min values. For systems in whichoccasionally for some few queries it may be the case thatn4M, we set Q�Lðo;tÞ ¼QLðo;tÞ ¼ 0 for each particular query.

In the example of Figs. 1–3, object o1 moves faster thanobjects o2 and o3, with Rmax ¼ Rsrvðo1Þ ¼ 12=23¼ 0:52calculated as the number of servers visited by o1 per timeunit. The slowest objects have Rsrvðo2Þ ¼ 0:33 andRsrvðo3Þ ¼ 0:29, given an average Rmin ¼ 0:31. Using thesevalues, the object-location non-uniform quality bound ofQ�Lðo2 ;t9Þ

is

Q�Lðo2 ;t9Þ¼ 1-

Rsrvðo2Þ-Rmin

Rmax-Rmin

� �� ��QLðo;tÞ ¼ 1-

0:33-0:31

0:52-0:31

� �� �

�0:94¼ 0:85:

Definition 3.9 (Range quality bound). Let S be the set ofservers whose geographic extents at t intersect the querywindow w of the range query Rsliceðw; p; tÞ, AreaðgÞ be the

area of a region g, andTg be the geometric intersection of

two regions. A quality bound to the approximated answerof a range query Rsliceðw; p; tÞ is

QRðw;p;tÞ ¼

Ps2S

Qðs; tÞObjectsðs; tÞ

� AreaðExtentðs; tÞTgwÞ

� �P

s2S AreaðExtentðs; tÞÞ:

ð8Þ

To illustrate how the range quality bound can be

determined, consider the example in Figs. 1–3 and the

query window w in Fig. 6a drawn as a rectangle with thick

borders. This query window intersects the area under

control of servers s17 and s16. Fig. 6b shows, as a rectangle

with dashed lines, the geographic extent of server s16

stored in the meta-index at time instant t8 (t8 is the

closest time instant to t9 when a robot visits s16). In this

example, no geographic extent is stored in the meta-index

for server s17.

For this example, S ¼ fs16g, since the meta-index does

not contain information about server s17. Finally, in this

example the range quality bound is equivalent to

QRðw;p;t9Þ ¼Qðs16; t9Þ

Objectsðs16; t9Þ�

AreaðExtentðs16; t9ÞTgwÞ

AreaðExtentðs16; t9ÞÞ

¼1:875

2� 0:45¼ 0:42:

Definition 3.10 (Top-k quality bound). Let S ¼ ðs1; . . . ;skþ1Þ be the list of servers ordered by their number ofobjects for an approximated answer to the top-k serversKOðt; kÞ. Let also S0 ¼ ðsi; . . . ; sjÞ, with 1r ir jrk, be thelist of servers for which Qðsl; tÞZObjectsðskþ1; tÞ, withsirslrsj. A quality bound for the approximated answerto KOðt; kÞ is

QKðt;kÞ ¼jS0jjSj : ð9Þ

For the example in Figs. 1–3, a ranking list of servers by

number of objects at time instant t9 (with 1 the highest

rank) and its corresponding reduction bound, excluding

ARTICLE IN PRESS

Table 1Example of reduction bound for servers.

Server Objectsðs; t9Þ Ranking AvgMout ðsÞ Qðs; t9Þ

s1 1 2 1 0

s7 0 3 0.09 0

s10 2 1 0.67 0

s12 1 2 0.17 0.49

s13 1 2 0.17 0.49

s16 2 1 0.125 1.875

s18 1 2 0.12 0.04

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 647

servers that were never visited by a robot, are shown in

Table 1.

Using this example and for KOðt9;1Þ, with k¼ 1 mean-

ing the servers with the highest number of objects,

S ¼ fs10; s16g, and S0 ¼ s16. Consequently, the top-k quality

bound for this query is

QKðt;kÞ ¼jS0jjSj ¼ 0:5:

Definition 3.11 (NN quality bound). Let S be the set ofservers that are the approximated answer to the nearest-neighbor query NN sliceðo; tÞ, s0 ¼ Traceðo; tÞ 2 S, andw¼ Extentðs0; tÞ. A quality bound for the approximatedanswer to NN sliceðo; tÞ is

QNN sliceðo;tÞ ¼

Ps2S

Qðs; tÞObjectsðs; tÞ

� AreaðExtentðs; tÞTgwÞ

� �P

s2SAreaðExtentðs; tÞÞ:

ð10Þ

For non-uniform systems the valor Qðs; tÞ in expressions

(8) and (10) is replaced by an average Q�Lðo;tÞ calculated by

considering all of the objects o that are part of the

approximated answer to the respective query.

For the example in Figs. 1–3 and a NN sliceðo2; t9Þ with

objects moving uniformly, we have that Traceðo2; t9Þ ¼ s16

and CSðo2; t9Þ ¼ fs13; s16; s12g. Given AreaðExtentðs12; t9ÞÞ

¼ 0:1875, AreaðExtentðs13; t9ÞÞ ¼ 1:75, AreaðExtentðs16; t9ÞÞ

¼ 2:1, AreaðExtentðs12; t9ÞTgExtentðs16; t9ÞÞ ¼ 0:01, Area

ðExtentðs13; t9ÞTgExtentðs16; t9ÞÞ ¼ 0:03, and AreaðExtent

ðs16; t9ÞTgExtentðs16; t9ÞÞ ¼ 1, the NN quality bound is given

by

QNN sliceðo2 ;t9Þ¼

0:49 � 0:01þ0:49 � 0:03þ1:875

4:0375¼ 0:47:

3.3.3. Object-location query processing

We now concentrate on the exact answer to queriesabout the location of particular objects at given timeinstants. We show alternative algorithms depending onthe facilities provided by the actual database system. Thealgorithms to solve this type of query combine the searchin the meta-index and the search in the database serversselected by the meta-index. The local search methodperformed in the servers is out of the scope of this paperas it depends on the type of spatio-temporal indexing datastructure used at each server.

Given a query Lsliceðo; p; tÞ, the query solver makes a listof servers found in the trace of object o sorted by thedifference between the time of data collection and querytime t. If Traceðo; tÞ ¼ nil, then the query solver creates alist with all servers s 2 S sorted by their number of objectsat query time ðObjectsðs; p; tÞÞ.

The result of the search in the meta-index is a rankedlist of servers. Then, the process continues by visitingservers in priority order to find a location of the object inthe locally stored data. Two different approaches that leadto two alternative algorithms are: with path following (PF)behavior or with no path following (NPF) behavior. A PFalgorithm assumes that local servers are able to figure outfrom where or to where an object is moving, so that whenan object is leaving a region, a move_out event with thefinal destination is stored in the servers (PF Algorithm 1).This type of algorithm is useful in applications withphysical objects moving to contiguous locations. A NPFalgorithm visits servers in the priority order until thelocation of the desired object at the query time is found(NPF Algorithm 2). This type of algorithm is generallyapplicable without assumptions about the origin ordestination of objects and is more general as it can beemployed in applications that are not intrinsically spatial(Section 4.1 shows an application of this type).

Algorithm 1. PF Algorithm to process object-location

queries for time instants with STM-index

1

// Query: Lsliceðo; p; tÞ

2

if Traceðo; tÞ ¼ nil then 3 Rank servers according to Objectsðs; p; tÞ. Greater numbers

imply higher priorities.

4

Visit every server in statistical-based ranking order asking for

events of the object in the server.

5

Let s be the first server found to have an instance of o. Select

the later event found at time t0 less than or equal to t.

6

if t0 ¼ t then 7 Report solution.

8

else 9 Follow the object’s path until reaching the server in which

the time of the movement of object o is greater than t. Report

solution.

10

end if 11 else 12 Find the server in the trace of object o at a time closest to time

t. Let s be the first server found to have an instance of o.

13

if s contains the object at time t

14

Report solution.

15

else 16 Start the search of object o at server s following the object’s

path until reaching the server in which the time of the

movement of object o is greater than t. Report solution.

17

end if 18 end if

Algorithm 2. NPF Algorithm to process object-location

queries for time instants with STM-index

1

// Query: Lsliceðo;p; tÞ

2

if Traceðo; tÞ ¼ nil

3

Rank servers according to Objectsðs;p; tÞ. Greater numbers

imply higher priorities.

4

Visit server in ranking order and search for location of object o.

Stop when the object location at the query time was found in a

server.

5

else

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661648

6

Rank servers in the object’s trace according to their timestamp

in the trace with respect to the query time. Shortest time

distances implies higher priorities.

7

Visit servers in the trace-based ranking order and search for

the object location. Stop if the object location at the query time

was found in the server.

8

If the object location at the query time was not found in the

ranked list of servers, rank the remaining servers based on the

statistical information ðObjectsðs; p; tÞÞ.

9

Visit server in the short statistical-based ranking order and

search for the object location. Stop when the object location at

the query time was found in the server

10

end if

To illustrate how the PF and NPF algorithms performtheir work, consider the examples in Figs. 1–3. LetLsliceðo2; p; t20Þ be an object-location query, where p isthe class of objects with id o2. Traceðo2; t20Þ returns thesorted list of servers L¼ ðs18; s16; s10Þ based on the timedifference between the time of data collection stored inthe meta-index and the query time. Since server s18 doesnot contain the location of o2 at time t20, the PF algorithmfollows the path given by the move_out event that occurswhen object o2 leaves server s18 and enters server s19,until it finds the exact location of object o2 in server s19 attime t20. When the NPF algorithm does not find thelocation of object o2 in s18 at time t20, in contrast, itcontinues searching in the subsequent servers of list L.Since none of those servers store the location of object o2

at time t20, the algorithm creates a list with the remainingservers not in L, sorted by the statistical information aboutthe number of objects per server at the collection timeclosest to the query time. This list is L0 ¼ ðs1; s12; s13; s7; . . .Þ

and contains the four remaining servers that have beenvisited in the data collection process plus all other serverswith no visits until the query time and have priority equalto 0. In this case, the process will need to visit, in theworst case, the complete list to find that object o2 was inserver s19 (with no data collection) at time t20.

For a very even distribution of objects in all servers, thenumber of servers to be visited, by using only statistical-based ranking to answer an object-location query, tendsto be similar to the OðjSjÞ random server selectionalgorithm. For a non-uniform distribution, however, thestatistical-based ranked list of servers increases theprobability of finding the target object, as it conductsthe search to the most-likely-to-contain-the-answer ser-ver first. In such cases, the expected number of visitedservers before finding a location of an object should beless than OðjSjÞ. The combination of trace with statisticalinformation improves even more the performance of thesearch process, since it uses partial trace information tofind a first location of the target object. This claim issupported by the experimental evaluation in Section 4.

3.3.4. Spatio-temporal range query processing

We present the algorithm for a time-slice query notingthat its extension to a time interval query is straightfor-ward. This algorithm considers the intersection of thequery window with the geographic extent of severs in themeta-index at a time instant closest to the query time. InAlgorithm 3, S is an approximated answer and O is arefined answer to query Qsliceðw;p; tÞ.

Algorithm 3. Algorithm to process time-slice queries

1

// Query: Qsliceðw; p; tÞ

2

Set S to an empty set of servers

3

Set O to an empty set of objects

4

For each server s in the system

5

If Extentðs; tÞ intersects w, then insert s into S 6 Visit every server s 2 S and, for each object o of type p in server s

at the time instant t, if location of object o intersects w, then insert

o into O

3.3.5. Top-k query processing

We present the algorithm for a top-k query based onthe number of objects in each server, noting that itsmodification for a query based on geographic extent isstraightforward. This algorithm ranks the number ofobjects in servers at the time closest to the query time.In Algorithm 4, S1 is an approximated answer and S2 is arefined answer to a query KOðt; p; tÞ.

Algorithm 4. Algorithm to process top-k server queriesbased on the number of objects

1

// Query: KOðt; p; tÞ 2 Set S1 to an empty set of servers

3

Set S2 to an empty set of servers

4

Create set N with a tuple ðs;nÞ per each server s in the system

and where n¼Objectsðs;p; tÞ

5

Sort N in terms of the number of objects

6

Insert in S1 all servers s in tuples ðs;nÞ of N whose value n is at

the top-k number of objects

7

Create a set S0 with tuples ðs;nÞ of N for servers s with value n at

the top-2k number of objects

8

Visit server in S0 and replace the value n of tuples ðs;nÞ 2 S0 with

the real number of objects at time t in server s

9

Sort S0 in terms of the number of objects

10

Insert in S2 all servers in tuples ðs;nÞ of S0 whose value n is at the

top-k number of objects

3.3.6. Nearest-neighbor query processing

We present the algorithm to process NN queries. Thisalgorithm searches the closest objects to a target object o

of type p at a given time instant t. In Algorithm 5, S is anapproximated answer andO is a refined answer to a queryNN ðo; p; tÞ.

Algorithm 5. Algorithm to process a NN query

1

// Query: NN ðo; p; tÞ 2 Set S to an empty set of servers

3

Set O to an empty set of objects

4

Set O0 to an empty set of objects

5

Let s0 ¼ Traceðo; tÞ

6

Let ðx0 ; y0Þ be the exact location of objects o at time t as derived

from the object-location Algorithm 3.3.3

7

Insert s0 into S 8 For each server s in the system

9

If Extentðs; tÞ intersects s0, then insert s into S 10 Visit every server s 2 S and insert into O0 tuples ðo0 ; x; yÞ, with o0ao,

o0 is an object of type p that was in s at location x; y and time instant t.

11

Calculate and obtain the minimal distance between ðx0 ; y0Þ and

distance ðx; yÞ of each object o0 in set O0 . Insert in O objects with

the minimal distance.

3.4. Keeping the meta-index updated

Meta-index updating relates to the strategies forcollecting data from the distributed servers to keep the

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 649

meta-index properly updated. It is akin to the concept ofreplication of information [26,19,5], since partial informa-tion of the servers is replicated in the meta-index.

We propose crawling as a method for collectinginformation from distributed moving object databaseservers, and propose a server’s ranking policy to let thecrawler visit servers so that the available communicationbandwidth is efficiently exploited by downloading datafrom the best ranked servers preferently [20]. A highlyranked server is one whose object’s dynamics makes it apriority to contain an up to date register of its activity inthe meta-index. Note that the ranking value of a servercan change dynamically, that is, a particular server couldonly be ranked better than others during a fixed period oftime depending on the evolution of the moving objects. Interms of crawling and for a given period of time, we saythat a server is more relevant than another if it has abetter position in the overall ranking of servers. In aproduction system, the overall ranking is periodically re-calculated and crawling never ends by repeatedly visitingservers in ranking order.

To rank servers our method uses values b and d. Thevalue of bði; t1; tnÞ is a measure of the relative importanceof a server i in an interval ½t1; tn�, and the value dði; t1; tnÞ ameasure of how active is the server i during ½t1; tn�. Werepresent the ranking value of servers as points in a 2Dspace b� d (see Fig. 7), and the ranking value of server i isdefined by the inverse of the distance between its point inthe b� d plane and the maximum reference point(1.0,1.0), namely

rði; t1; tnÞ ¼ 1-

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið1-bði; t1; tnÞÞ

2þð1-dði; t1; tnÞÞ

2

2

s: ð11Þ

The details of b and d are as follows.

Definition 3.12 (Importance of servers). Given a totalnumber of objects N and the number of objects nði; tÞ ina server i at time t, then the importance relative tothe whole system of server i for a time interval ½t1; tn� isgiven by

bði; t1; tnÞ ¼AvgNði; t1; tnÞ

N; ð12Þ

where AvgNði; t1; tnÞ is the weighted average

AvgNði; t1; tnÞ ¼Xtn-1

tk ¼ t1

nði; tkþ1Þ � ðtkþ1-tkÞ=ðtn-t1Þ:

Fig. 7. Ranking space.

Definition 3.13 (Normalized variation). Given the totalnumber of objects nði; tÞ in a server i at time instant t, themaximum average number of objects MaxAvgNði; t1; tnÞ

observed in the server i in the time interval ½t1; tn�, is

MaxAvgNði; t1; tnÞ ¼maxfnði; tjÞ with tj ¼ t1; t2; . . . ; tng:

The normalized variation of the number of objects in aserver i is defined as

aði; t1; tnÞ ¼MaxAvgNði; t1; tnÞ

nði; t1Þ: ð13Þ

Definition 3.14 (Normalized rate of update requests). Letnaði; t1; tnÞ be the number of times that the normalizedvariation aði; t1; tnÞ of a server i during a time interval½t1; tn� is larger than a tolerance threshold t (i.e.,aði; t1; tÞ4t, with trtn). Then the normalized rate ofnumber of update requests is defined as

dði; t1; tnÞ ¼naði; t1; tnÞ

tn-t1�

1

dT; ð14Þ

where dT is the normalization sum over all servers in theperiod.

Note that the highest values for bði; t1; tnÞ and dði; t1; tnÞ

(i.e., 1.0) represent the case when the server concentratesthe total set of objects and update requests at all timeinstances. The formulae presented to determine b and dcan be extended for systems containing several types ofmoving objects by employing a linear combination of therespective b and d values per type (a weight factor pertype must be defined to combine the different b and dvalues).

At the meta-index side, the crawler is formed by ascheduler that makes use of a list of server addressesordered by ranking in the b� d plane to assign the job ofretrieving data from servers to a set of so-called robots.The robots are threads that connect to servers anddownload data from them. Normally, the schedulerfetches a server address and passes it onto the next idlerobot. Once a robot retrieves the data associated with itscurrent server, it extracts from it statistical data, andpasses the data to the scheduler. These data are used tore-calculate the ranking of servers in the scheduler’sordered list.

Note that in actual systems, it is crucial to reduce theamount of data to be transferred between servers and thecrawler. With respect to the statistical data in the meta-index, what is delivered to the crawler during a commu-nication action is the actual number of objects of everytype in a server i and the cummulative geographic extentof the server i since the last visit.

The specific time instant t of the data collection action,along with the number of objects per type, is also used tokeep track of the nði; tÞ values required to determine theranking value of server i. The server can respond with anapproximate representation of the function for thenumber of objects of every type in server i observed sincethe previous data collection. Alternatively, it can respondwith an average nði; tÞ value since the last visit. In thiscase, it suffices to send just the average number of objectsper type registered in the period and assign to this set a

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661650

timestamp equal to the instant in which the datacollection is performed. This timestamp is used to indexthe set in the temporal part of the meta-index for thestatistic meta-index.

For the trace and trajectory information, what is sent isa list of events associated with the moving objects thathave been in the server since the last visit to this server.As with the previous case, the type of data stored in themeta-index has a direct effect on the amount of datatransmitted to the meta-index. This, in turn, has an effecton how close to the present system evolution the meta-index is. We have observed that a significant reduction ofthe amount of communication (at the expense of a minorefficiency degradation) can be achieved, if at the instant ofthe data collection, a set of object ids is simply transferredand the crawler assigns them a timestamp equal to thetime of the visit.

In the crawling strategy, the control of the updateoperations is kept at the meta-index side. In Section 4 wecompare our crawling approach with its counterpart,namely harvesting, which is a method in which the serversautonomously send data to the meta-index. Both methodscan be related to pulling and pushing strategies, respec-tively, for data replication [26].

In the harvesting strategy there exists at the meta-index side a harvester that receives requests from serversto update information. To trigger the request of an updateto the meta-index, every server periodically calculates thea value defined by Eq. (13). In the case that the relativevariation of values is larger than a tolerance threshold t,the server sends an update request to the harvester. If theharvester is idle, it receives the information; otherwise,the server needs to retry latter. It does so after the serverhas processed a given number of move-in and move-outevents waiting, as a result, a random time before tryingagain. The use of the same a and t than the crawlingstrategy ensures that both approaches are comparedunder the same conditions.

3.5. Data compression

The communication bandwidth can be used moreefficiently if data transmissions are made in compressedform. As illustrated in Figs. 2 a and b, the data that arecollected from the servers and stored in the meta-indexare composed of a constant size header followed by asequence of object ID values. The data transferred from aserver i has the form ðsi; tÞ : o1; . . . ; on, where the pair ðsi; tÞ

indicates the server ID and t the time instant of thecollection, and the sequence o1; . . . ; on are the object IDsfound in the server. If we assume that the sequenceo1; . . . ; on is sorted by increasing ID values, then thissequence can be efficiently compressed by re-writing it aso1; d2; . . . ; dn where dk is the difference between the IDassociated with objects ok and ok-1, so that, to produce anyID, say oj, we need to calculate the sum oj ¼ o1þ

Pjk ¼ 2 dk.

The idea is that the differences dj can be represented usingless bits than the absolute values oj [50]. To this end, wecan store the differences by just using the minimumnumber of bits required to represent the maximum value

among the values dj. In Section 4 we show that this simplescheme achieves good compression rates.

On the other hand, the data stored in the meta-indexfor any object oi can also be compressed because it has theform ðs1; t1Þ; . . . ; ðsn; tnÞ where si is the server ID and ti thetime of the data collection which can be an integer valuejust like Unix timestamps. The values of ti are inchronological order, namely they are sorted by increasingtimestamp order, so they can be represented using thesame compression method. However, the values si are inrandom order so we need to spend one additional bit perserver ID value to represent the sign of the differencebetween server IDs sk and sk-1.

To prevent getting large values for the differencesbetween timestamps (and thereby large number of bits torepresent them) it is convenient to store actual time-stamps in a table and use an additional consecutive ID toidentify particular timestamps. This because two con-secutive data collections are not expected to take place infairly close consecutive timestamps. In addition, to furtherimprove compression rate, a discretization can be appliedto merge close timestamps into the same ID.

4. Experimental evaluation

We first show experimental results from an actual andvery large system of moving objects and then study in amuch more detailed manner the advantages and limita-tions of the meta-index strategy we propose in this paper.

4.1. The WWW application

We evaluate the performance of the meta-index forsearching the trajectory of users in the Web. The Web canbe seen as a very large virtual space that is crossed byusers during their navigations through different sites. Assuch, it resembles a spatio-temporal application ofmoving objects, where moving objects are users visitingWeb sites. We refer to this application as a spatial–temporal system of Web navigations (STWN). A STWNsystem allows one to explore the temporal as well as thespatial dimension of the user navigation process. Inparticular, one can extract information about the usersor types of users that have visited a server, whether or nota particular user was connected to the Web at a timeinterval, and what servers a particular user visited duringa time period.

An important characteristic of this application is thatthe structure of the Web does not follow any topologicalor geometric constraint as the physical geographic spacedoes. The concept of adjacent locations present in thegeographic space does not hold in the virtual space of theWeb. Therefore, an object can move from one server toany other, without visiting servers in between. Conse-quently, we cannot assume that the system will be able tocapture the origin and destination of object movementsacross servers and, therefore, only the NPF searchalgorithm (non-path following algorithm) is applicable.Also, the users may enter and leave the system severaltimes. All these characteristics make this application a

ARTICLE IN PRESS

Table 2Performance of searching in the STWN application under 2000 location queries of tree types (b1, b2 and b3).

Partitions Robots Seq-b1 Indx-b1 Seq-b2 Indx-b2 Seq-b3 Indx-b3

1 3200 50,132.45 1.14 51,678.23 146.43 51,228.87 4.74

10 320 5000.15 1.09 5132.04 26.81 5098.47 1.02

50 64 1000.37 1.02 1026.81 7.03 1020.13 1.01

100 32 500.41 1.01 513.66 4.30 510.34 1.02

50 32 1000.37 1.02 1026.81 7.27 1020.13 1.01

10 32 5000.15 189.04 5132.04 962.92 5098.47 85.55

1 32 49,997.19 34,939.31 51,315.92 37,609.36 50,980.25 36,791.77

Seq-X is the average number of visited servers in a sequential search for query type X and Indx-X is the average number of visited servers in a search using

the meta-index for query type X, where types X= b1, b2 and b3 represent random navigation, large navigation and small navigation queries, respectively.

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 651

highly demanding scenario for validating the proposal ofthis work.

The basic assumptions for this system are thefollowing.

Users are identified by their IPs. This may not bealways a realistic scenario, since dynamic assignmentsof IPs prevent a more precise association of users withIPs; however, this is an approximation that is reliablewhen searching for types of users or institutions. � There exists a set of servers that keep track of the user

visits to Web sites, by means of a proxy or log file. Suchservers form our distributed system of spatio-temporaldatabase servers.

We performed experiments using 112,000 Web sitesfrom the ‘‘.com’’ domain and obtained navigationsperformed by over 4 million different IPs in a week. Fromproxy servers, we obtain the Unix timestamp and site atwhich every visit was performed and latency times wereobtained by performing direct crawling on the Web sites.We consider several latency times of servers that dependon the time of crawling. Users (‘‘moving objects’’) in thesystem visit an average number of seven different servers.Most visits concentrate on a small subset of servers (i.e.,non-uniform distribution of objects in servers) with manyreiterative visits to the same servers.

Using the data extracted from proxy servers and thelatency times determined by direct crawling to Web sites,we created a meta-index that combines statistical andtrace data of the navigation trajectory of users in the Web.This meta-index is then used to answer 2000 locationqueries of users selected uniformly at random (b1), 2000location queries of users with the largest navigationtrajectories (b2), and 2000 location queries of objectswith the shortest navigation trajectories (b3). All queryobjects are users whose first visit to a server was prior tothe query time; that is, they are existing objects in thesystem at the query time. Queries use the same timeequivalent to the 80% of the total time period of the datasample (i.e., a total period of a week equivalent to604,800 s).

Given the large number of servers (Web sites) to beconsidered during searches, we resorted to the use ofparallelism to improve overall efficiency. To this end, wedivided the system in P partitions and allocated a crawler

and meta-index to each partition. Every partition isassigned to a different processor (computer). Web sitesare distributed uniformly at random onto the partitions sothat every crawler is in charge of collecting data from adifferent subset of the Web sites. Searches are performedby sending the query to every partition (meta-index) sothat in each partition a search of the same query is startedby visiting the set of servers retrieved from the respectivemeta-index. In each partition we employ the NFPalgorithm. The search in all partitions stops as soon asthe search algorithm in one partition reports the correctanswer to the query.

We studied the performance under different config-urations for the number of robots and bandwidthavailable for crawling and searches. Table 2 shows theresults for P¼ 1;10;50 and 100 partitions for cases inwhich 3200 robots are evenly distributed onto thepartitions (first part of the table) and 32 robots areassigned to each scheduler in charge of a partition (secondpart of the table). The first case shows differentalternatives for administering the 3200 robots andrepresents a case of large bandwidth. The second caserepresents a situation of small bandwidth for performingcrawling. The best performance is achieved with P¼ 100partitions and 32 robots per partition (central part of thetable). This case, together with the first part ofthe table, shows that for fast objects it is better todivide the robots in partitions rather than putting all ofthem to work on a single large partition. The number ofmissing objects (which causes statistical search) dropssignificantly using partitions as crawling on databaseservers can be performed faster than in the case of onepartition.

4.2. Detailed performance evaluation

Our approach to the study of the comparativeperformance of our query algorithms is based on the useof object-oriented discrete-event simulation. These simu-lations allow us to evaluate the different alternativesunder the same conditions and explore demandingscenarios. Our simulators use event traces generated bywell-known spatio-temporal data-set generators. On topof that, we simulate the actions of the crawling strategyby taking into consideration the cost of latency and datatransfers between database servers and crawler robots. At

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661652

the same time, after a warm-up period, we introduce aconstant stream of random spatio-temporal queries toevaluate the efficiency of the query algorithms and studytheir advantages and limitations.

From the experience obtained with the crawling of theWWW application, the values of latency and rate of datatransfers between crawler and servers were adjusted toapproximate a real-life setting in the discrete eventsimulator. If those values would have been set to be toosmall, the crawling would have been so fast that allstrategies would have achieved nearly the same perfor-mance.

In the experiments below, we refer to the cummulativeeffect of communication latency, rate of data transfer(bytes per second) among robots and servers, and thenumber of crawler’s robots as crawling bandwidth. Thisdefines the capability of the crawler to keep properlyupdated the meta-index. We present experimentswith low and high crawling bandwidth. In each case, weuse the same crawling bandwidth in the simulator tocompare the performance of the alternative strategies forthe meta-index.

We compare different alternatives of meta-index datacontent to support the proposal of a combined statisticaland trace meta-index (STM-index). In particular, weevaluate query processing that only uses statistical, traceor trajectory data. Like trace data, trajectory data is a listwith a subset of locations visited by an object, but in thiscase, the list is in temporal order. Trajectories in the meta-index are created/updated for only those objects alive inthe servers at the time of data collection. Recall that fortrace data, the meta-index is updated with all objects thathave been registered in the server, even when theseobjects may have moved to a region controlled by anotherserver. Thus, we obtain a sequence with the subset ofvisited servers by an object, which is in chronologicalorder.

Due to restrictions on the maximum number of objectsimposed by the data-set generators and memory limita-tions for the computer running our simulators, wesimulate systems with a relatively small number ofservers (typically five hundred). However, in our case thisis just a scaling issue as we tune the simulationparameters to reflect a proper relationship betweencrawling bandwidth and number of servers. The WWWapplication of the previous section is an example thatshows that current Internet bandwidths are able to keepat the meta-index side data coming from hundreds ofthousand servers in a fairly synchronized manner. For thesystems described below, the amount of information to betransferred between servers and crawler in every datacollection is expected to be of the same order of, forinstance, Web page sizes.

4.2.1. Data sets

We use different data sets (i.e., event traces) generatedby two available spatio-temporal dataset generators: thegenerator of spatio-temporal data (GSTD) [44,45] and thenetwork-based generator of moving objects (NGMO) [1,2].With GSTD we created a dataset of 50,000 objects movingwith a Gaussian distribution over the space. With NGMO

we created a dataset of 50,000 initial objects movingacross a network. While GSTD keeps the number ofobjects constant, NGMO creates and eliminates objectsalong time. We set up the rate of new and eliminatedobjects in NGMO so that the number of objects was withinthe range [45,000–50,000].

We deploy servers over the area containing movingobjects by using two strategies. The aim is to producedifferent distributions of objects across servers. A firststrategy applies a regular cell partitioning over the space,with a number of cells equal to the number of servers.This produces high imbalance. A second strategy tries toimprove uniformity of number of objects per server. Thisis effected by placing more servers in areas of higherconcentration of movements based on an uniform cellpartitioning of the space and a density analysis per cell.Cells with the highest densities were chosen for placingthe servers. Fig. 8 shows the distribution of 50,000 objectsover 500 servers in the different datasets. We used thesedata sets in our experiments and adjusted crawlingbandwidth to emulate the requirements of large scalesystems (e.g., saturation and latency and transfer rates).

Once servers are assigned to different areas, weassociate the initial objects with the servers that areclosest to each of them. Objects and servers are pointsðx; yÞ in the space. Using the ðx; yÞ coordinates of serverswe define a Voronoi partition of the space whichdetermines the instants in which the objects enter andleave servers. We call these move-in and move-outevents, respectively.

A last synthetic (but very demanding) data set wascreated by combining different data sets generated byGSTD. The aim was to create a highly non-uniformdistribution of objects per server, together with a non-uniform distribution of the average number of objects perserver along time. We refer to this data set as GINE. Asummary of the characteristics of the different datasetsare shown in Table 3, where the number of movementsrefers to move_in events; that is, movements betweenservers and not within a server.

The experimental evaluation considers search algo-rithms that were designed under the following conditions:

1.

Historical versus non-historical data: These experimentsevaluate whether or not considering time informationin the meta-index would speed up the search process.We refer to this evaluation as with history (H) andwithout history (NH).

2.

PF versus NPF algorithms: These experiments evaluatethe effect of having or not data about ‘‘from where’’ or‘‘where to’’ is an object movement (i.e., explicit orimplicit move-in and move-out events, respectively).

4.2.2. Crawling versus harvesting

In Fig. 9 we show the effectiveness of the triggeringrules of the crawler against a case in which the crawlerjust selects uniformly at random the next server to bevisited. Both for the non-uniform and uniform GSTDsystems, the results show that the triggering rules allowthe crawler to collect information about the complete set

ARTICLE IN PRESS

Fig. 8. Sets of 50,000 objects on 500 servers: (a) GSTD with uniform distribution of objects (GU), (b) GSTD with non-uniform distribution of objects (GN),

(c) NGMO with uniform distribution of objects (MU), and (d) NGMO with non-uniform distribution of objects (MN).

Table 3Characteristics of data sets.

Data Number of objects Number of movements Average trajectory length

GINE 200,000 901,486 6.5

GU 50,000 241,620 4.6

GN 50,000 226,617 5.6

MU 45,000–55,000 2,119,813 15.8

MN 45,000–55,000 2,204,510 16.4

0

10000

20000

30000

40000

50000

0 0.2 0.4 0.6 0.8 1

A = triggering rules

B = random selection

A B

0

10000

20000

30000

40000

50000

0 0.2 0.4 0.6 0.8 1

A = triggering rules

B = random selection

A B

Fig. 9. Crawling the GSTD system (described in Section 4.2.1) for the trace-based searching algorithm with 50,000 objects. (a) GSTD for non-uniform

initial distribution of objects to servers and (b) uniform distribution. The y-axis shows the cumulative number of different objects collected by the

crawler. The x-axis shows the total simulation period.

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 653

of objects way before than the random selectionalternative.

Table 4 shows a comparison between crawling andharvesting, namely the alternative strategy of lettingservers themselves initiate data transfers. To make thecomparison fair, harvesting uses the same triggering rulesthan crawling but at the server side. Also the meta-indexmachine accepts the same number of concurrent

connections than the number of robots used by thecrawler. The rows of the table show results for differentaverage network latencies involved in sending messagesfrom servers to meta-index. The first column H/C, whichrefers to how behind from present time are the eventsstored in the meta-index, shows that crawling is able tokeep the meta-index updated at a much faster speed thanharvesting. As shown in the second column C/H, this

ARTICLE IN PRESS

Table 4Crawling (C) versus harvesting (H) for simulations with 32 robots upon

average network latencies of 0.1, 0.5 and 1.0 units of time.

Latency Lagging H/C Visits C/H Busy/visits

Uniform GSTD

0.1 637.51 147.47 1.82

0.5 93.61 20.61 2.92

1.0 47.74 8.83 4.43

Non-uniform GSTD

0.1 589.50 137.34 1.78

0.5 95.88 20.22 2.92

1.0 48.67 9.05 4.49

Table 5Compression rates for crawling and indexing.

Robots Crawling Indexing

Single All Single All

8 0.34 0.31 0.51 0.32

16 0.35 0.33 0.48 0.29

32 0.38 0.35 0.48 0.28

64 0.41 0.37 0.47 0.28

Table 6Average trajectories length of 200 query objects: b1 random objects,

b2 objects with the largest trajectories, and b3 objects with the shortest

trajectories.

System Objs. b1 b2 b3

GINE 200,000 6.475 11.705 1.990

GU 50,000 5.755 11.355 2.990

GN 50,000 5.720 11.210 2.990

MU 150,000 17.880 82.815 2.000

MN 150,000 19.700 95.525 1.995

In the case of the MU and NU systems, the Table shows the total number

of new objects that were created in order to maintain an average

number of moving objects between 45,000 and 55,000.

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661654

causes crawling to pay more visits to relevant servers thanan equivalent situation with harvesting. In harvesting,servers have to decide by themselves, with no informationof other servers, the instants in which they are preparedto transfer data to the meta-index. As shown by columnBusy/Visits, this causes that many data transfer attemptsfind busy the meta-index with all its concurrentconnections exhausted.

4.2.3. Compression

In Table 5 we show the compression rate achieved bythe method described in Section 3.5. We define this rateas the ratio of total bits occupied by compressed data touncompressed data. The table shows results forcompressing the data transferred from servers to meta-index (crawling columns) and the space required by thedata associated with each object stored in the meta-index(indexing columns). The columns ‘‘All’’ indicate thecompression rate considering the space required by theentire data whereas the columns ‘‘Single’’ indicate theaverage compression rate of individual chunks of data,namely a single data transfer or data stored for a singleobject. These results indicate that the space requirementfor the whole data can be compressed below 38%.Individual transfers of data for single objects can becompressed below 52% on the average. The differencebetween the two quantities is because in the experimentswe performed many single actions that were dominatedby short sequences, which are less compressible, and thisaffected the average shown in the columns labeled as‘‘Single’’. However, large highly compressible sequencesdominated the space occupied by data and, at theend, they reduced the overall space requirements tobelow 38%.

4.2.4. Query processing

We first evaluate the cost of processing queries of thetype find the location of an object x at time instant t with thethree different types of data in the meta-index (i.e.,statistical, trace, and trajectory data). Then, we show that,for this type of queries, the combination of statistics withtraces outperforms the other alternatives. The data inthese meta-indexes were obtained from the simulatedcrawling with 4 and 8 robots, and latency time of 0.001units. We selected these parameters because they allowus to properly consider the effect of the crawler speed onthe search effectiveness during query processing. Theobjective is to illustrate a practical setting as opposed tosituations where crawling is so fast that all strategies tendto achieve similar performance. In addition, for the meta-indexes based on trajectories and traces of objects, weshow the best results from trade-off data-transfer timeand amount of object ids kept at the meta-index side,obtained when sending 100%, 75%, 50% or 25% of theselected objects in every visit to the servers.

We performed runs of 200 queries selecting the targetobjects in three different ways: objects selected uniformlyat random (labeled as b1 in figures below), objects withthe largest trajectories across servers (b2), and objectswith the shortest trajectories (b3). The average number oftrajectories of objects for these three types of queries areshown in Table 6. The time of the queries was set to 0.8(being 1.0 the end of the simulation), and only the objectsthat were present in the system at time 0.8 were selectedas query objects. This is particularly important for theNMGO data, since objects enter and leave the systemduring the simulation period. To compare the differentsearch strategies we use the average number of serversvisited per query. This number was normalized withrespect to the average number of servers visited by therandom server selection algorithm. Thus a value of 1.0means that the respective search algorithm performs likethe random server selection algorithm (i.e., having ameta-index does not produce any improvement at all).

4.2.5. Results for the GINE system

We ran over this highly dynamic system our searchalgorithms which were classified as path followers (PFs)and non-path followers (NPFs), considering a meta-indexthat does and does not keep historical information

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 655

(prefixes H and NH, respectively). We have four permuta-tions denoted by letters A = HPFs, B = NHPFs, C = HNPFsand D = NHNPFs. For each of these situations, we run thethree benchmarks, namely b1 (average objects), b2 (fastobjects) and b3 (slow objects) using the statistics-basedmeta-index (STAT), trajectory-based meta-index (TRAJ),and trace-based meta-index (TRACE).

The results for small crawling bandwidth (4 robots)and large crawling bandwidth (8 robots) are shown inFig. 10. No single strategy is the best for all cases. TheGINE system presents a case of high variability of theimportance of servers along time and it contains a largenumber of moving objects. The reduction in performanceof TRAJ and TRACE are due to the cases in which the targetobject was not found in the meta-index (when thishappens the search is performed as in the randomserver selection algorithm).

For this kind of systems, the results show that acombination of the STAT and TRACE strategies is a goodalternative. This is straightforward: in each trace-baseddata collection, the servers also send the average numberof objects per type measured since the last robot visit.Clearly for systems in which the actual number ofdifferent types of objects is much smaller than the totalnumber of objects, this does not lead to a significantincrease in the communication cost (intuitively most real-life applications are expected to satisfy this requirement).In this combined scheme, the TRACE search algorithmresorts to a STAT search when the target object of thequery is not present in the meta-index. Below we showperformance results for this combination.

Note that in Fig. 10 the two sequences A-B and C-Dshow the impact of keeping historical information in themeta-index. Also some strategies are more affected by thecrawling bandwidth than others. STAT shows a morestable performance.

Nevertheless, it is clear that the trajectory-basedstrategy is more informative of the overall evolution ofthe moving objects system. A trajectory-based meta-indexstructure would enable data mining to discover generaltrends about object movement without having to scan thelocal servers.

Fig. 10. Results for the GINE system. The y-axis shows the average number of

guided to random-selected servers. The x-axis shows different alternatives for

search: cases A = HPFs, B = NHPFs, C = HNPFs and D = NHNPFs (details in Sectio

4.2.6. Results for the GSTD and NGMO systems

Fig. 11 shows results for GSTD and NGMO withuniform and non-uniform distributions of objects acrossservers. Unlike GINE, the system exhibits a stable steadystate behavior, that is, the importance of servers in termsof the number of objects does not change significantlyalong time.

These results show how detrimental for the perfor-mance of STAT are systems in which the object distribu-tion is uniform on the average. In general terms, when thestatistical data collected from servers are not enough toestablish a proper ranking of servers at search time, theperformance can become as poor as the random serverselection search algorithm. The trajectory-based meta-index is best suited for path following searches whereasthe trace-based meta-index shows a more stable behaviorfor no-path following searches.

Interestingly for the statistics-based meta-index,Fig. 11 shows a better performance in the no-pathfollowing type of search. Note also that trace-basedmeta-index may contain sequences of object locations(servers) that may not follow a real time order. This canlead to unstable behavior in some cases.

For trajectory and trace meta-indexes, it is crucial toefficiency whether or not the object being searched is amissing object in the meta-index. We have observed thatincreasing the crawling bandwidth has a greater positiveimpact on the trace- than the trajectory-based meta-index. In TRACE the number of missing objects decreasesmore drastically. This can be seen in the Table 7 thatshows the ratio of missing objects at crawling bandwidthof 8 robots to missing objects at crawling bandwidth of 4robots (a dash indicates that no missing objects wereregistered in both crawling bandwidths). Smallernumbers indicate larger reduction of missing objectswhen crawling bandwidth increases from 4 to 8. Thisconfirms the intuition that the trace-based strategy is ableto maintain in the meta-index more objects than thetrajectory-based one.

In Fig. 12 we show the performance of the combinationof the strategies TRACE and STAT against the TRACEstrategy for all the systems used in our experiments. The

servers visited during searches represented as the ratio of meta-index-

the information kept at the meta-index and its corresponding type of

n 4.2.5). (a) Small crawling bandwidth and (b) large crawling bandwidth.

ARTICLE IN PRESS

Fig. 11. Results for the GSTD and NGMO systems. The y-axis shows the average number of servers visited during searches represented as the ratio of

meta-index-guided to random-selected servers. The x-axis indicates GU = uniform GSTD, GN = non-uniform GSTD, MU = uniform NGMO and MN = non-

uniform NGMO (details in Section 4.2.6). (a) Large crawling bandwidth path following search and (b) large crawling bandwidth no-path following search.

Table 7Ratio between missing objects at crawling bandwidths with 8 and 4

robots.

System Strategy b1 b2 b3

GSTD-uniform Trace – – –

GSTD-uniform Traj 0.010 0.070 0.001

GSTD-non-uniform Trace – – –

GSTD-non-uniform Traj 0.120 0.030 0.001

NGMO-uniform Trace 0.001 – 0.010

NGMO-uniform Traj 0.640 0.380 0.780

NGMO-non-uniform Trace 0.001 – 0.001

NGMO-non-uniform Traj 0.460 0.440 0.520

Fig. 12. Performance of the trace-stat combination strategy (first set of

curves) against the trace strategy (second set of curves) for small

crawling bandwidth and path-following search. This y-axis shows the

ratio number of servers visited using the trace and trace+statistic

strategies to the number of servers visited using the server random

selection strategy. The x-axis indicates the different workloads used in

our experiments.

Table 8Approximated queries at a certain time instant in the NGMO system.

Type Query Results

Top-k Servers with the largest

number of objects

Fig. 13a 1st 10–250

Servers with the largest

geographic extent

Fig. 13a 2nd 10–250

Spatial nearest neighbors

to an object

Fig. 13a 16–64

Location Location of a given object Fig. 13b 1st 16–64

Range Objects in spatial window Fig. 13b 2nd 16–64

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661656

results show that for the cases in which there is a largenumber of missing objects in the trace-based index, theuse of statistical information enables a significantreduction in the number of servers visited duringsearches (this can be observed in the MU and MNsystems).

4.2.7. Approximated queries

One of the advantages of the proposed STM-indexscheme is that it represents a statistical sample of thewhole system. The quality of the sample depends on thespeed at which the meta-index is updated by robots. Inthe following, we show experimental results for repre-sentative queries that can be solved using only the datastored in the meta-index and establish comparisons withthe results obtained by visiting the database servers.

In Table 8 we show the set of queries and theircorresponding performance curves in Figs. 13a and b. Inthe x-axis the values 10, 50 and 250 represent the k valuesfor the cases of top-k ranking for the number of objectsand area. The values 16, 32 and 64 represent the numberof robots used by the crawler in a NGMO system with 500servers and an average of 50,000 moving objects. Theresults obtained in the other benchmark systems arequalitatively similar.

The first two top-k queries are the determination of theservers with the largest number of objects and geographic

extents at a given time instant, respectively. Therandomly generated queries retrieve the top 10, 50 and250 servers. Fig. 13a shows Pearson’s correlationbetween the answer calculated by using only the data

ARTICLE IN PRESS

Fig. 13. Approximated queries on the NGMO system: (a) results for the top-k servers queries and (b) results for range search based on the meta-index

alone (MI) and meta-index with access to database servers (MI–DBS) and location search using MI alone.

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 657

stored in the meta-index (data collected using 16, 32 and64 crawler robots) and the exact answer calculatedby an exhaustive examination of all database servers.Values closer to 1.0 indicate a better approximation to theexact results.

The third part of Fig. 13a shows the 50 objects that areclosest to a given object at a given time instant. Theresults are shown as a fraction of the objects that are partof the exact answer for average speed objects (b1), fastobjects (b2) and slow objects (b3), and meta-index datacollected using 16, 32 and 64 crawler robots.

The first part of Fig. 13b shows results for rangequeries using only the meta-index (MI) and the meta-index plus access to the database servers (MI-DBS), allcompared against the exact answer obtained by exhaus-tive traversal of all database servers. The curves show thefraction of objects that are in the exact answer to queriesgenerated at random by producing a spatial window anddetermining the objects within its area at a given timeinstant. Note that the case MI-DBS is different than thecase in which the system uses information of thegeographic area for which every database server isresponsible for and goes directly to the necessary serversto solve the range query exactly. In the case of the MI-DBSstrategy, those database servers are inferred from theobjects data stored in the meta-index and then the resultis refined by contacting those servers (this is useful forsystems in which the geographic extent can varydynamically along time). In the case of the MI alone, theanswer is computed with the objects located in the meta-index only. Fig. 13b shows results for 16, 32 and 64crawler robots.

Finally the second part of Fig. 13b shows results for theobject-location queries that are the main focus of thispaper, but solved by using only the data stored in themeta-index. The curves show the fraction of queries withcorrect answers for the three types of object speeds(b1, b2 and b3) and for data collected using 16, 32 and 64robots.

Overall, the results of Fig. 13 show that there are manycases in which the data stored in the meta-index can beused to provide a very fast and good approximatedanswer to classical queries for this kind of database

systems. Users can use this preliminary ‘‘hint’’ to furtherinvestigate the system evolution or to refine theirsubsequent queries.

4.2.8. Quality bounds

In Table 9 we show results for the quality boundspredicted by the method based on the rate of move-outevents (Section 3.3.2). These results are for theexperiments of Figs. 13a and b. The columns ‘‘Frac’’show the average fraction of correct results which areobtained by solving each query using only the data storedin the meta-index, namely approximated queries. Thisfraction was obtained by comparing with the exact resultsfor each query. Columns ‘‘Bound’’ are a pessimisticprediction of how good (on the average) are the resultsobtained by the approximated queries. This prediction isalso calculated by using the data stored in the meta-indexand its value is presented to the user in a systemsupporting approximated queries on the meta-index.The first section (top) of Table 9 shows results forqueries involving average objects (b1), namely objectsselected uniformly at random for nearest neighbors (NN)and location queries. These results show that, on theaverage, the prediction is below and close to the actualmeasure of quality of results. As expected, a betterapproximation is obtained when there are more robotssince there is a greater chance of registering moremove-out events in the meta-index. However, wedetected a few cases ðo10%Þ where move-out eventswere not registered for the involved servers and therebyno prediction was possible for those queries.

The second section (bottom) of Table 9 shows resultsfor the two particular cases for the object that is part ofeach query, namely b2 which is a case in which beforeexecuting nq queries upon the meta-index, the nq

corresponding objects are the fastest nq objects in theentire system. For b3 those objects are the slowest nq

objects and these objects represent the most favorablecase for the predictions. For the fastest objects (b2) thebounds do not make sense for slow crawling (16 and 32robots). Nevertheless, for fast crawling they becomeeffective and this represents a trade-off that should bedetermined experimentally for each target system. In our

ARTICLE IN PRESS

Table 9Fraction of correct results for queries versus predicted bounds on those results.

Robots Top-10 Range NN-b1 Location-b1

Frac Bound Frac Bound Frac Bound Frac Bound

16 0.97 0.51 0.74 0.37 0.59 0.42 0.43 0.41

32 0.98 0.72 0.86 0.52 0.69 0.53 0.56 0.51

64 0.99 0.87 0.91 0.66 0.86 0.61 0.64 0.55

Robots NN-b2 NN-b3 Location-b2 Location-b3

Frac Bound Frac Bound Frac Bound Frac Bound

16 0.15 0.23 0.96 0.60 0.16 0.37 0.82 0.69

32 0.32 0.32 0.98 0.74 0.33 0.39 0.95 0.84

64 0.48 0.35 0.99 0.81 0.48 0.40 0.99 0.89

Fig. 14. Scalability experiments: (a) performance of the search strategies (stat, traj, and trace) with respect to randomly selected servers (seq) and (b)

number of visits with respect to number of objects and meta-index.

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661658

system we found that objects with rates of server visitsabove three times the average rate should be preventedfrom computing these bounds.

4.2.9. Scalability experiments

To study the scalability of the search strategies, weevaluated the behavior of the meta-index using GSTDwith non-uniform distribution in a case in which both thesystem increases the number of objects from 50,000 to400,000 and the number of servers from 500 to 4000. Inparticular, we evaluated the performance of the queryprocessing with crawling bandwidth of 8 robots forqueries of type b1 (we observed a very similar behaviorfor queries b2 and b3).

Fig. 14 shows the results, where values of x-axisrepresent the scale of the system denoted by 50equivalent to 50,000 objects and 500 servers, 100equivalent to 100,000 objects and 1000 servers, and soon. The y-axis of Fig. 14a indicates the average number ofservers visited during searches with each strategy includingvalues obtained with the random server selection searchalgorithm (in the figure we label it with seq). For thetrajectory-based strategy, we consider cases when the datacollection send 100% and 50% of the available objects

(curves are labeled traj-50 and traj-100, respectively). Weadjusted the number of object ids sent for the trace-basedcase to fit the same amount of data been transmitted (labelstrace-50 and trace-100, respectively). For the statisticalstrategy (stat) the crawler collects only the averagenumber of objects per type.

The results in Fig. 14a show that the trace- andstatistical-based strategies have a much better scalabilityproperties than the trajectory-based strategy. In Fig. 14bthe y-axis indicates the number of visits to servers duringthe crawling associated with every strategy. This figureshows that, because of lesser amount of data transmittedby the crawler, more visits to servers are possible. Thetrace strategy with fewer visits is able to achieve goodperformance in terms of scalability.

4.2.10. Comparison against alternative strategies

A natural question is how the proposal of this papercompares against previous approaches to speeding upquery processing in large spatio-temporal moving objectdatabases. Basically previous works propose indexstrategies whereas ours is a pair (meta-index, crawler)that can be mounted on top of any existing centralized ordistributed index. The crawler collects trace and statistical

ARTICLE IN PRESS

Table 10Comparison against an oracle meta-index with results presented as the

ratio of number of servers visited by the oracle meta-index to number of

servers visited by the proposed meta-index in order to find the exact

answer to queries.

Robots Range Location-

b1

Location-

b2

Location-

b3

Type NN-

queries

16 0.69 0.60 0.38 0.93 b1 0.72

32 0.73 0.70 0.52 0.97 b2 0.66

64 0.73 0.75 0.64 0.98 b3 0.98

Top-k queries achieve 0.5. Results for the NGMO system.

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 659

data from database servers and stores them in the meta-index which is used to quickly drive user queries to thedatabase servers able to compute the exact answer toqueries. In the meantime, the meta-index content can beused to quickly present the user with a fairly-good-qualityapproximated answer of his/her query. Also previousworks do not address location queries and are suitable forEuclidean spaces only whereas the meta-index algorithmsinclude location, range and aggregation queries for bothvirtual and Euclidean spaces. Thus previous approaches toindexing are substantially different from ours and cannotbe fairly compared against the proposed meta-indexunder the same design aims and target distributedsystems.

Differences can be further evidenced when one tries touse an existing centralized or distributed index for themeta-index role. For instance, the underlying assumption incentralized indexes is that the object position updates arereflected in the data structure instantaneously. Thatapproach is impractical in our setting. We work under theassumption that data takes time to be transferred from onepoint to another in the network and that database serversare not supposed to communicate with each other, they arejust required to provide an interface to service the sporadicvisits of the crawler robots. Distributed index strategiesmust cope with the burden of keeping the consistency(invariants) associated with the spatial data structure uponwhich they are built whereas the proposed meta-indexdoes not impose any interconnection topology or commu-nication protocol among database servers.

To establish a baseline comparison with respect tofuture possible meta-indexing variations we obtained theresults presented in Table 10 (details in the caption). Thisis a comparison against the best possible strategy for oursetting. Namely an oracle, that is, a hypothetical systemthat upon each single user query is capable ofinstantaneously getting a snapshot of the whole systemback into the query time and determining the databaseservers capable of producing the exact answer to thequery. The results for the different types of queries showthat in many cases the proposed meta-index is able toachieve near optimal performance on the average.

5. Concluding remarks and future work

We have proposed a crawler/meta-index strategy forfully distributed systems of moving objects where it is

relevant to keep track of past states of objects. Theproposal combines statistical with coarse trace data toreduce the average number of servers that must be visitedin order to answer different types of spatio-temporalqueries. The experimental results obtained by using bothan actual very large WWW application and a set ofdemanding benchmarks show that the proposed strategysolves the different types of queries in an efficientmanner. In all cases, searching the meta-index quicklydetermines the reduced subset of servers that produce theexact answer to queries. The meta-index is particularlysuitable for location queries such as ‘‘find the location of agiven object at a given instant of time’’, which we believeis an important type of query that has not beenconsidered in previous work.

The data stored in the meta-index also enables the fastcalculation of the approximated answer to most popularspatio-temporal queries. In particular, we have evaluatedrange queries (time-slice or time interval) and top-k

queries about the servers or regions with high density ofmoving objects, and also closest neighbors and locationqueries. For these queries, the query solver obtainedapproximated answers, which, on the average, are inmany instances very close to the exact answers. We alsouse the data stored in the meta-index to provide a lowerbound on the quality of the predicted answerto queries. These bounds are useful to either (i) quicklypresenting (bound values permitting) to the user a firstversion of the answer to a query whilst the systemproceeds to calculate the exact answer, or (ii) efficientlycope with peaks in user query traffic by respondingapproximated answers for bounds signaling good quality,whereas other queries are sent to one or more servers ofthe subset determined by the meta-index to improvequality of approximated answers. When traffic is restoredto normal, the process of calculating the exact answer toqueries is also restored to normal. This last case is left asfuture research.

We have also observed that the data stored in themeta-index is highly compressible and the search highlyparallelizable. For the first case we have presented acompression method that is able to reduce the spaceoccupied by data to below 38%. This is specially useful forefficiently exploiting the communication bandwidth sothat the meta-index can be updated faster. The secondcase is related to the fact that the data stored in the meta-index can be distributed in disjoint partitions and updatedindependently by one crawler instance per partition. Thisallows the efficient parallelization of the searchingprocess. We illustrated the use of this property in theWWW application. In particular, as the crawler and meta-index strategies do not require servers to communicatewith each other, it is possible to arbitrarily partition theset of servers. In each partition a scheduler and a set ofrobots are assigned the task of periodically crawling theirrespective fraction of the set of servers, and searches startin parallel in each partition of the meta-index and areconducted to the respective subset of servers where theystop as soon as the exact answer is found in one of thepartitions. To ensure good load balance, the servers can beassigned uniformly at random onto partitions. For the

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661660

WWW application, our results show that by using amoderate number of partitions it is possible to achievevery small searching costs for millions of objects movingacross thousands of database servers.

Acknowledgment

Partially funded by CONICYT, Chile, under GrantFONDECYT 1050944. We wish to thank anonymousreferees whose constructive comments contributed tosignificantly improve the quality of our paper. We alsowish to thank Prof. Isaac Scherson for useful comments.

References

[1] T. Brinkhoff, A framework for generating network-based movingobjects, GeoInformatica 6 (2) (2002) 153–180.

[2] T. Brinkhoff, Network-based generator of moving objects (webpage) /http://fh-oow.de/institute/iapg/personen/brinkhoff/generator/S, 2005.

[3] V. Prasad Chakka, A. Everspaugh, J.M. Patel, Indexing largetrajectory data sets with SETI, in: Conference on Innovative DataSystems Research (CIDR), 2003.

[4] R. Cheng, D.V. Kalashnikov, S. Prabhakar, Querying imprecise datain moving object environments, IEEE Trans. Knowl. Data Eng. 16 (9)(2004) 1112–1127.

[5] L. Do, P. Ram, P. Drew, The need for distributed asynchronoustransactions, ACM SIGMOD Rec. 28 (2) (1999) 534–535.

[6] H. Ding, G. Trajcevski, P. Scheuermann, Efficient maintenance ofcontinuous queries for trajectories, GeoInformatica 12 (3) (2007)255–288.

[7] C. du Mouza, P. Rigaux, Web architectures for scalablemoving object servers, in: 10th International Symposium onAdvances in Geographic Information Systems, ACM Press, NewYork, 2002, pp. 17–22.

[8] M. Erwig, R. Hartmut Guting, M. Schneider, M. Vazirgiannis, Spatio-temporal data types: an approach to modeling and queryingmoving objects in databases, GeoInformatica 3 (3) (1999) 269–296.

[9] L. Forlizzi, R. Hartmut Guting, E. Nardelli, M. Schneider, A datamodel and data structures for moving objects databases, in:SIGMOD Conference, ACM, New York, 2000, pp. 319–330.

[10] B. Gedik, L. Liu, Mobieyes: distributed processing of continuouslymoving queries on moving objects in a mobile system, EDBT,Lecture Notes in Computer Science, vol. 2992, Springer, Berlin,2004, pp. 67–87.

[11] G.A. Gutierrez, G. Navarro, A. Rodrıguez, A.F. Gonzalez, J. Orellana, Aspatio-temporal access method based on snapshots and events, in:GIS, ACM, New York, 2005, pp. 115–124.

[12] R. Hartmut Guting, M.H. Bohlen, M. Erwig, C.S. Jensen, N.A.Lorentzos, E. Nardelli, M. Schneider, J.R. Rios Viqueira, Spatio-temporal models and languages: an approach based on data types,Spatio-Temporal Databases: The CHOROCHRONOS Approach, Lec-ture Notes in Computer Science, vol. 2520, Springer, Berlin, 2003,pp. 117–176.

[13] A. Guttman, R-Trees: a dynamic index structure for spatialsearching, in: ACM SIGMOD Conference on Management of Data,ACM, New York, 1984, pp. 47–57.

[14] C. Hernandez, M. Andrea Rodrıguez, M. Marın, Complex queriesfor moving object databases in DHT-based systems, 14th Euro-pean International Conference on Parallel Processing (Euro-Par),Lecture Notes in Computer Science, vol. 5168, Springer, Berlin,2008, pp. 424–433.

[15] C. Hernandez, M. Andrea Rodrıguez, M. Marın, A p2p meta-indexfor spatio-temporal moving object databases, DASFAA,Lecture Notes in Computer Science, vol. 4947, Springer, Berlin,2008, pp. 653–660.

[16] G. Iwerks, H. Samet, K. Smith, Maintenance of k-NN and spatial joinqueries on continuously moving points, ACM Transactions onDatabase Systems 31 (2) (2006) 485–536.

[17] K.H. (Kane) Kim, Object structures for real-time systems andsimulators, IEEE Comput. 30 (8) (1997) 62–70.

[18] H. Lee, J. Hwang, J. Lee, S. Park, C. Lee, Y. Nah, S. Jeon, M.h. Kim,Long-term location data management for distributed moving objectdatabases, in: 9th IEEE International Symposium on Object and

Component-Oriented Real-Time Distributed Computing, IEEE-CS,2006, pp. 451–458.

[19] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, C. Olston, Finding(recently) frequent items in distributed data streams, in: ICDE,IEEE-CS, 2005, pp. 767–778.

[20] M. Marin, A. Rodriguez, T. Fincke, C. Roman, Searching movingobjects in a spatio-temporal distributed database servers system,OTM Conferences (2), Lecture Notes in Computer Science, vol. 4276,Springer, Berlin, 2006, pp. 1388–1401.

[21] A. Meka, A.K. Singh, DIST: A distributed spatio-temporalindex structure for sensor networks, in: CIKM, ACM, New York,2005, pp. 139–146.

[22] A. Mondal, Y. Lifu, M. Kitsuregawa, P2P-tree: an R-tree-basedspatial index for peer-to-peer environments, EDBT Workshops,Lecture Notes in Computer Science, vol. 3268, Springer, Berlin,2004, pp. 516–525.

[23] Y. Nah, J. Lee, W.J. Lee, H. Lee, M.h. Kim, K.-J. Han, Distributedscalable location data management system based on the galisarchitecture, in: 10th IEEE International Workshop on Object-Oriented Real Time Dependable Systems, IEEE-CS, 2005, pp. 397–404.

[24] M.A. Nascimento, J.R.O. Silva, Y. Theodoridis, Access structures formoving points, Technical Report TR–33, TIME CENTER, 1998.

[25] M.A. Nascimento, J.R.O. Silva, Y. Theodoridis, Evaluation of accessstructures for discretely moving points, in: Spatio-TemporalDatabase Management, 1999, pp. 171–188.

[26] C. Olston, J. Widom, Efficient monitoring and querying ofdistributed, dynamic data via approximate replication, IEEE DataEng. Bull. 28 (1) (2005) 11–18.

[27] D. Papadias, Y. Tao, P. Kalnis, J. Zhang, Indexing spatio-temporaldata warehouses, in: ICDE, 2002, pp. 463–472.

[28] A. Di Pasquale, L. Forlizzi, C.S. Jensen, Y. Manolopoulos, E. Nardelli,D. Pfoser, G. Proietti, S. Saltenis, Y. Theodoridis, T. Tzouramanis, M.Vassilakopoulos, Access methods and query processing techniques,Spatio-Temporal Databases: The CHOROCHRONOS Approach, Lec-ture Notes in Computer Science, vol. 2520, Springer, Berlin, 2003,pp. 117–176.

[29] D. Pfoser, Indexing the trajectories of moving objects, IEEE DataEng. Bull. 25 (2) (2002) 3–9.

[30] D. Pfoser, C.S. Jensen, Y. Theodoridis, Novel approaches in queryprocessing for moving object trajectories, VLDB J. (2000) 395–406.

[31] D. Pfoser, C.S. Jensen, Y. Theodoridis, Novel approaches to indexingof moving object trajectories, VLDB J. (2000) 395–406.

[32] S. Prabhakar, Y. Xia, D. Kalashnikov, W.G. Aref, S. Hambrusch, Queryindexing and velocity constrained indexing: scalable technique forcontinuous queries on moving objects, IEEE Trans. Comput. 51 (10)(2002) 1124–1140.

[33] N. Roussopoulos, S. Kelley, F. Vincent, Nearest neighbor queries, in:SIGMOD Conference, ACM Press, New York, 1995, pp. 71–79.

[34] B. Schnitzer, S.T. Leutenegger, Master-client R-trees: a new parallelR-Tree architecture, in: SSDBM, 1999, pp. 68–77.

[35] C. Shahabi, M.R. Kolahdouzan, S. Thakkar, J. Luis Ambite, C.A.Knoblock, Efficiently querying moving objects with pre-definedpaths in a distributed environment, in: ACM-GIS, ACM Press, NewYork, 2001, pp. 34–40.

[36] S. Spaccapietra, Editorial: spatio-temporal data models and lan-guages, GeoInformatica 5 (1) (2001) 5–9.

[37] E. Tanin, A. Harwood, H. Samet, Using a distributed quadtree indexin peer-to-peer networks, VLDB J. 16 (2) (2007) 165–178.

[38] Y. Tao, R. Cheng, X. Xiao, W. Kay Ngai, B. Kao, S. Prabhakar,Indexing multi-dimensional uncertain data with arbitraryprobability density functions, in: VLDB, ACM, New York, 2005, pp.922–933.

[39] Y. Tao, X. Xiao, R. Cheng, Range search on multidimensionaluncertain data, ACM Transactions on Database Systems 32 (3)(2007) 15.

[40] Y. Tao, D. Papadias, Efficient historical R-Tree, in: SSDBM, 2001, pp.223–232.

[41] Y. Tao, D. Papadias, MV3R-tree: a spatio-temporal access methodfor timestamp and interval queries, in: VLDB, 2001, pp. 431–440.

[42] Y. Tao, D. Papadias, MV3R-tree: a spatio-temporal access methodfor timestamp and interval queries, VLDB J. (2001) 431–440.

[43] Y. Tao, D. Papadias, Historical spatio-temporal aggregation, ACMTrans. Inf. Syst. 23 (1) (2005) 61–102.

[44] Y. Theodoridis, M.A. Nascimento, Generating spatiotemporal data-sets on the WWW, SIGMOD Rec. 29 (3) (2000) 39–43.

[45] Y. Theodoridis, J.R.O. Silva, M.A. Nascimento, On the generation ofspatiotemporal datasets, SSD, Lecture Notes in Computer Science,vol. 1651, Springer, Berlin, 1999.

ARTICLE IN PRESS

M. Marin, M.A. Rodrıguez / Information Systems 35 (2010) 637–661 661

[46] Y. Theodoridis, M. Vazirgiannis, T.K. Sellis, Spatio-temporal indexing forlarge multimedia applications, in: ICMCS, IEEE-CS, 1996, pp. 441–448.

[47] G. Trajcevski, H. Ding, P. Scheuermann, I.F. Cruz, Bora: routing andaggregation for distributed processing of spatio-temporal rangequeries, in: MDM, IEEE-CS, 2007, pp. 36–43.

[48] G. Trajcevski, O. Wolfson, F. Zhang, S. Chamberlain, The geometry ofuncertainty in moving objects databases, EDBT, Lecture Notes inComputer Science, vol. 2287, Springer, Berlin, 2002, pp. 233–250.

[49] Y. Xia, S. Prabhakar, Efficient CNG indexing in locationaware services, in: 23rd International Conference on Dis-tributed Computing Systems Workshops, IEEE Press, New York,2003, p. 414.

[50] H. Yan, S. Ding, T. Suel, Inverted index compressionand query processing with optimized document ordering, in:18th International Conference on World Wide Web, ACM, NewYork, April 2009, pp. 401–410.