c 2008 subramanian arumugam - university of...

EFFICIENT ALGORITHMS FOR SPATIOTEMPORAL DATA MANAGEMENT

By

SUBRAMANIAN ARUMUGAM

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2008

1

c© 2008 Subramanian Arumugam

2

To my parents.

3

ACKNOWLEDGMENTS

First of all, I would like to thank my advisor Chris Jermaine. This dissertation would

not have been made possible had it not been for his excellent mentoring and guidance

through the years. Chris is a terrific teacher, a critical thinker and a passionate researcher.

He has served as a great role model and has helped me mature as a researcher. I cannot

thank him more for that.

My thanks also goes to Prof. Alin Dobra. Through the years, Alin has been a patient

listener and has helped me structure and refine my ideas countless times. His excitement

for research is contagious!

I would like to take this opportunity to mention my colleagues at the database

center: Amit, Florin, Fei, Luis, Mingxi and Ravi. I have had many hours of fun discussing

interesting problems with them. Special thanks goes to my friends Manas, Srijit, Arun,

Shantanu, and Seema, for making my stay in Gainesville all the more enjoyable.

Finally, I would like thank my parents for being a source of constant support and

encouragment throughout my studies.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Research Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.1 Data Modeling and Database Design . . . . . . . . . . . . . . . . . 151.2.2 Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.2.5 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3.1 Scalable Join Processing over Massive Spatiotemporal Histories . . . 181.3.2 Entity Resolution in Spatiotemporal Databases . . . . . . . . . . . . 191.3.3 Selection Queries over Probabilistic Spatiotemporal Databases . . . 19

2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1 Spatiotemporal Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3 Probabilistic Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES . 25

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Moving Object Trajectories . . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Closest Point of Approach (CPA) Problem . . . . . . . . . . . . . . 28

3.3 Join Using Indexing Structures . . . . . . . . . . . . . . . . . . . . . . . . 303.3.1 Trajectory Index Structures . . . . . . . . . . . . . . . . . . . . . . 313.3.2 R-tree Based CPA Join . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Join Using Plane-Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Basic CPA Join using Plane-Sweeping . . . . . . . . . . . . . . . . . 363.4.2 Problem With The Basic Approach . . . . . . . . . . . . . . . . . . 373.4.3 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Adaptive Plane-Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5

3.5.2 Cost Associated With a Given Granularity . . . . . . . . . . . . . . 413.5.3 The Basic Adaptive Plane-Sweep . . . . . . . . . . . . . . . . . . . 413.5.4 Estimating Costα . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5.5 Determining The Best Cost . . . . . . . . . . . . . . . . . . . . . . . 443.5.6 Speeding Up the Estimation . . . . . . . . . . . . . . . . . . . . . . 463.5.7 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6.1 Test Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6.2 Methodology and Results . . . . . . . . . . . . . . . . . . . . . . . . 503.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 ENTITY RESOLUTION IN SPATIOTEMPORAL DATABASES . . . . . . . . 58

4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2 Outline of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.1 PDF for Restricted Motion . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 PDF for Unrestricted Motion . . . . . . . . . . . . . . . . . . . . . . 65

4.4 Learning the Restricted Model . . . . . . . . . . . . . . . . . . . . . . . . . 664.4.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 674.4.2 Learning K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5 Learning Unrestricted Motion . . . . . . . . . . . . . . . . . . . . . . . . . 714.5.1 Applying a Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . 724.5.2 Handling Multiple Objects . . . . . . . . . . . . . . . . . . . . . . . 734.5.3 Update Strategy for a Sample given Multiple Objects . . . . . . . . 754.5.4 Speeding Things Up . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5 SELECTION OVER PROBABILISTIC SPATIOTEMPORAL RELATIONS . . 84

5.1 Problem and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.1.2 The False Positive Problem . . . . . . . . . . . . . . . . . . . . . . . 875.1.3 The False Negative Problem . . . . . . . . . . . . . . . . . . . . . . 90

5.2 The Sequential Probability Ratio Test (SPRT) . . . . . . . . . . . . . . . . 915.3 The End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3.1 What’s Wrong With the SPRT? . . . . . . . . . . . . . . . . . . . . 955.3.2 Removing the Magic Epsilon . . . . . . . . . . . . . . . . . . . . . . 965.3.3 The End-Biased Algorithm . . . . . . . . . . . . . . . . . . . . . . . 97

5.4 Indexing the End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.2 Building the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6

5.4.3 Processing Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7

LIST OF TABLES

Table page

4-1 Varying the number of objects and its effect on recall, precision and runtime. . . 80

4-2 Varying the number of time ticks. . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4-3 Varying the number of sensors fired. . . . . . . . . . . . . . . . . . . . . . . . . 80

4-4 Varying the standard deviation of the Gaussian cloud. . . . . . . . . . . . . . . 80

4-5 Varying the number of time ticks where EM is applied. . . . . . . . . . . . . . . 81

5-1 Running times over varying database sizes. . . . . . . . . . . . . . . . . . . . . . 109

5-2 Running times over varying query sizes. . . . . . . . . . . . . . . . . . . . . . . 109

5-3 Running times over varying object standard deviations. . . . . . . . . . . . . . . 109

5-4 Running times over varying confidence levels. . . . . . . . . . . . . . . . . . . . 109

8

LIST OF FIGURES

Figure page

3-1 Trajectory of an object (a) and its polyline approximation (b) . . . . . . . . . . 28

3-2 Closest Point of Approach Illustration . . . . . . . . . . . . . . . . . . . . . . . 29

3-3 CPA Illustration with trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3-4 Example of an R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3-5 Heuristic to speed up distance computation . . . . . . . . . . . . . . . . . . . . 34

3-6 Issues with R-trees- Fast moving object p joins with everyone . . . . . . . . . . 35

3-7 Progression of plane-sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3-8 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3-9 Problem with using large granularities for bounding box approximation . . . . . 40

3-10 Adaptively varying the granularity . . . . . . . . . . . . . . . . . . . . . . . . . 42

3-11 Convexity of cost function illustration. . . . . . . . . . . . . . . . . . . . . . . . 45

3-12 Iteratively evaluating k cut points . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3-13 Speeding up the Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3-14 Injection data set at time tick 2,650 . . . . . . . . . . . . . . . . . . . . . . . . . 49

3-15 Collision data set at time tick 1,500 . . . . . . . . . . . . . . . . . . . . . . . . . 50

3-16 Injection data set experimental results . . . . . . . . . . . . . . . . . . . . 51

3-17 Collision data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 52

3-18 Buffer size choices for Injection data set . . . . . . . . . . . . . . . . . . . . . . 53

3-19 Buffer size choices for Collision data set . . . . . . . . . . . . . . . . . . . . . . 53

3-20 Synthetic data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 54

3-21 Buffer size choices for Synthetic data set . . . . . . . . . . . . . . . . . . . . . . 56

4-1 Mapping of a set of observations for linear motion . . . . . . . . . . . . . . . . . 60

4-2 Object path (a) and quadratic fit for varying time ticks (b-d) . . . . . . . . . . . 62

4-3 Object path in a sensor field (a) and sensor firings triggered by object motion (b) 64

4-4 The baseline input set (10,000 observations) . . . . . . . . . . . . . . . . . . . . 79

9

4-5 The learned trajectories for the data of Figure 4-4 . . . . . . . . . . . . . . . . . 79

5-1 The SPRT in action. The middle line is the LRT statistic . . . . . . . . . . . . . 92

5-2 Two spatial queries over a database of objects with gaussian uncertainty . . . . 97

5-3 The sequence of SPRTs run by the end-biased test . . . . . . . . . . . . . . . . 98

5-4 Building the MBRs used to index the samples from the end-biased test. . . . . . 104

5-5 Using the index to speed the end-biased test . . . . . . . . . . . . . . . . . . . . 106

10

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

EFFICIENT ALGORITHMS FOR SPATIOTEMPORAL DATA MANAGEMENT

By

Subramanian Arumugam

August 2008

Chair: Christopher JermaineMajor: Computer Engineering

This work focuses on interesting data management problems that arise in the analysis,

modeling and querying of largescale spatiotemporal data. Such data naturally arise in the

context of many scientific and engineering applications that deal with physical processes

that evolve over time.

We first focus on the issue of scalable query processing in spatiotemporal databases.

In many applications that produce a large amount of data describing the paths of moving

objects, there is a need to ask questions about the interaction of objects over a long

recorded history. To aid such analysis, we consider the problem of computing joins over

moving object histories. The particular join studied is the “Closest-Point-Of-Approach”

join, which asks: Given a massive moving object history, which objects approached within

a distance d of one another?

Next, we study a novel variation of the classic entity resolution problem that

appears in sensor network applications. In entity resolution, the goal is to determine

whether or not various bits of data pertain to the same object. Given a large database of

spatiotemporal sensor observations that consist of (location, timestamp) pairs, our goal is

to perform an accurate segmentation of all of the observations into sets, where each set is

associated with one object. Each set should also be annotated with the path of the object

through the area.

11

Finally, we consider the problem of answering selection queries in a spatiotemporal

database, in the presence of uncertainty incorporated through a probabilistic model.

We propose very general algorithms that can be used to estimate the probability that a

selection predicate evaluates to “true” over a probabilistic attribute or attributes, where

the attributes are supplied only in the form of a pseudo-random attribute value generator.

This enables the efficient evaluation of queries such as “Find all vehicles that are in close

proximity to one another with probability p at time t” using Monte Carlo statistical

methods.

12

CHAPTER 1INTRODUCTION

This study is a step towards addressing some of the many issues faced in extending

current database technology to handle spatiotemporal data in a seamless and efficient

manner. This chapter motivates spatiotemporal data management and introduces

the reader to the main research issues. This is followed by a summary of the key

contributions.

1.1 Motivation

The last few years have seen a siginificant interest in extending databases to support

spatiotemporal data as evidenced by the growing number of books, workshops, and

conferences devoted to this topic [1–4]. The advent of computational science and the

increasing use of wireless technology, sensors, and devices such as GPS has resulted in

numerous potential sources of spatio-temporal data. Large volumes of spatiotemporal data

are produced by many scientific, engineering and business applications that track and

monitor moving objects. ‘ ‘Moving objects” may be people, vehicles, wildlife, products

in transit, weather systems. Such applications often arise in the context of traffic

surveillance and monitoring, land use management in GIS, simulation in astrophysics,

climate monitoring in earth sciences, fleet management, mulitmedia animation, etc. The

increasing importance of spatiotemporal data can be attributed to the improved reliability

of tracking devices and their low cost, which has reduced the acquisition barrier for such

data. Tracking devices have been adopted in varying degrees in a number of scientific

and enterprise application domains. For instance, vehicles increasingly come equipped

with GPS devices which enable location-based services [3]. Sensors play an increasingly

important role in surveillance and monitoring of physical spaces [5]. Enterprises such

as Walmart, Target and organizations like the Department of Defense (DoD) plan to

track products in their supply chain through use of smart Radio Frequency Identification

(RFID) labels [6].

13

Extending modern database systems to support spatiotemporal data is challenging for

several reasons:

• Conventional databases are designed to manage static data, whereas spatiotemporaldata describe spatial geometries that change continuously with time. This requires aunified approach to deal with aspects of spatiality and temporality.

• Current databases are designed to manage data that is precise. However, uncertaintyis often an inherent property in spatiotemporal data due to discretization ofcontinuous movement and measurement errors. The fact that most spatiotemporaldata sources (particularly polling and sampling-based schemes) provide only adiscrete snapshot of continuous movement poses new problems to query processing.For example, consider a conventional database record that stores the fact “JohnSmith earns $200,000” and a spatiotemporal record which stores the fact “JohnSmith walks from point A to point B” in the form of an discretized ordered pair(A,B). In the former case, a query such as “What is the salary of John Smith?”involves dealing with precise data. On the other hand, a spatiotemporal querysuch as “Did John Smith walk through point C between A and B?” requires dealingwith information that is often not known with certainty. Further compounding theproblem is that even the recorded observations are only accurate to within a fewdecimal places. Thus, even queries queries such “Identify all objects located at pointA” may not return meaningful results unless allowed a certain margin for error.

• Due to the presence of the time dimension, spatiotemporal applications have thepotential to produce a large amount of data. The sheer volume of data generatedby spatiotemporal applications presents a computational and data managementchallenge. For instance, it is not uncommon for many scientific processes to producespatiotemporal data in the order of terabytes or even petabytes [7]. Developingscalable algorithms to support query processing over tera- and peta-byte-sizedspatiotemporal data sets is a significant challenge.

• The semantics of many basic operations in a database changes in the presence ofspace and time. For instance, basic operations like joins typically employ equalitypredicates in a classic relational database, whereas equality is rare between twoarbitrary spatiotemporal objects.

1.2 Research Landscape

Over the last decade, database researchers have begun to respond to the challenges

posed by spatiotemporal data. Most of the research efforts is concentrated on supporting

either predictive or historical queries. Within this taxonomy, we can further distinguish

work based on whether they support time-instance or time-interval queries.

14

In predictive queries, the focus is on the future position of the objects and only a

limited time window of the object positions needs to be maintained. On the other hand,

for historical queries, the interest is on efficient retrieval of past history and thus the

database needs to maintain the complete timeline of an object’s past locations. Due to

these divergent requirements, techniques developed for predictive queries are often not

suitable for historical queries.

What follows is a brief tour of the major research areas in spatiotemporal data

management. For a more complete treatment of this topic, the interested reader is

referrred to [1, 3].

1.2.1 Data Modeling and Database Design

Early research focused on aspects of data modeling and database design for

spatiotemporal data [8]. Conventional data types employed in existing databases are

often not suitable to represent spatiotemporal data which describe continuous time-varying

spatial geometries. Thus, there is a need for a spatiotemporal type system that can model

continuously moving data. Depending on whether the underlying spatial object has an

extent or not, abstractions have been developed to model a moving point, line, and region

in two- and three-dimensional space with time considered as the additional dimension

[8–11]. Similarly, early work has also focused on refining existing CASE tools to aid in the

design of spatiotemporal databases. Existing conceptual tools such as ER diagrams and

UML present a non-temporal view of the world and extensions to incorporate temporal

and spatial awareness has been investigated [12, 13].

Recently there has been interest in designing flexible type systems that can model

aspects of uncertainty associated with an object’s spatial location [14]. There has also

been active effort towards designing SQL language extensions for spatiotemporal data

types and operations [15].

15

1.2.2 Access Methods

Efficient processing of spatiotemporal queries requires developing new techniques

for query evaluation, providing suitable access structures and storage mechanisms, and

designing efficient algorithms for the implementation of spatiotemporal operators.

Developing efficient access structures for spatiotemporal databases is an important

area of research. A variety of spatiotemporal index structures have been developed to

support selection queries over both predictive and historical queries, most based on

generalization of the R-tree [16] to incorporate the time dimension. Indexing structures

designed to support predictive queries typically manage object movement within a small

time window and need to handle frequent updates to object locations. A popular choice

for such applications is the TPR-tree [17] and its many variants.

On the other hand, index structures designed to support historical queries need to

manage an object’s entire past movement trajectory (for this reason they can be viewed as

trajectory indexes). Depending on the time interval indexed, the sheer volume of data that

needs to be managed present significant technical challenges for overlap-allowing indexing

schemes such as R-trees [16]. Thus, there has been interest in revisiting grid/cell-based

solutions that do not allow overlap, such as SETI [18]. Several tree-based indexing

structures have been developed such as STRIPES [19], 3D R-trees [20], TB trees [21] and

linear quad trees [22]. Further, spatiotemporal extensions of several popular queries such

as nearest-neighbor [23], top-k [24], and skyline [25] have been developed.

1.2.3 Query Processing

The development of efficient index structures has also led to a growing body of

research on different types of queries on spatiotemporal data, such as time-instant and

range queries [26–28], continuous queries, joins [29, 30], and their efficient evaluation

[31, 32]. In the same vein, there has also been seem preliminary work on optimizing

spatio-temporal selection queries [33, 34].

16

Much of the work focuses specifically on indexing two-dimensional space and/or

supporting time-instance or short time-interval selection queries. Thus many indexing

structures often do not scale well for higher-dimensional spaces and have difficulty with

queries over long time intervals. Finally, historical data collections may be huge and joins

over such data require new solutions, since predicates involved are non-traditional (such as

closest point of approach, within, sometimes-possibly-inside, etc.)

1.2.4 Data Analysis

Spatiotemporal data analysis allows us to obtain interesting insights from the stored

data collection. For instance:

• In a road network database, the history of movement of various objects can be usedto understand traffic patterns.

• In aviation, the flight path of various planes can be used in future path planning andcomputing minimum separation constraints to avoid collision.

• In wildlife management, one can understand animal migration patterns from thetrajectories traced by them.

• Pollutants can be traced to their source by studing air flow patterns of aerosolsstored as trajectories.

Research in this area focuses on extending traditional data mining techniques to the

analysis of large spatiotemporal data sets. Of interest includes discovering similiarities

among object trajectories [35], data classification and generalization [36], trajectory

clustering and rule mining [37–39], and supporting interactive visualization for browsing

large spatiotemporal collections [40].

1.2.5 Data Warehousing

Supporting data analysis also requires designing and maintaining large collections of

historical spatiotemporal data, which falls under the domain of data warehousing.

Conventional data warehouses are often designed around the goal of supporting

aggregate queries efficiently. However, the interesting queries in a spatiotemporal data

warehouse seek to discover the interaction patterns of moving objects and understand the

17

spatial and/or temporal relationships that exist between them. Facilitating such queries

in a scalable fashion over terabyte-sized spatiotemporal data warehouses is a significant

challenge. This requires extending traditional data mining techniques to the analysis

of large spatiotemporal data sets to discover spatial and temporal relationships, which

might exist at various levels of granularity involving complex data types. Research in

spatiotemporal data warehousing [41, 42] is relatively new and is focused on refining

existing multidimensional models to support continuous data and defining semantics for

spatiotemporal aggregation [43, 44].

1.3 Main Contributions

It is clear that extending modern database systems to support data management

and analysis of spatiotemporal data require addressing issues that span almost the entire

breadth of database research. A full treatment of the various issues can be the subject of

numerous dissertations! To keep the scope of this dissertation managable, I tackle three

important problems in spatiotemporal data management. The dissertation focuses on

data produced by moving objects, since “moving object” databases represent the most

common application domain for spatiotemporal databases [1]. The three specific problems

considered are described briefly in the following subsections.

1.3.1 Scalable Join Processing over Massive Spatiotemporal Histories

I first consider the scalability problem in computing joins over massive moving

object histories. In applications that produce a large amount of data describing the

paths of moving objects, there is a need to ask questions about the interaction of

objects over a long recorded history. This problem is becoming especially important

given the emergence of computational, simulation-based science (where simulations

of natural phenomenon naturally produce massive databases containing data with

spatial and temporal characteristics), and the increased prevalence of tracking and

positioning devices such as RFID and GPS. The particular join that I study is the “CPA”

(Closest-Point-Of-Approach) join, which asks: Given a massive moving object history,

18

which objects approached within a distance d of one another? I carefully consider several

obvious strategies for computing the answer to such a join, and then propose a novel,

adaptive join algorithm which naturally alters the way in which it computes the join in

response to the characteristics of the underlying data. A performance study over two

physics-based simulation data sets and a third, synthetic data set validates the utility of

my approach.

1.3.2 Entity Resolution in Spatiotemporal Databases

Next, I consider the problem of entity resolution for a large database of spatio-temporal

sensor observations. The following scenario is assumed. At each time-tick, one or more of

a large number of sensors report back that they have sensed activity at or near a specific

spatial location. For example, a magnetic sensor may report that a large metal object has

passed by. The goal is to partition the sensor observations into a number of subsets so

that it is likely that all of the observations in a single subset are associated with the same

entity, or physical object. For example, all of the sensor observations in one partition may

correspond to a single vehicle driving accross the area that is monitored. The dissertation

describes a two-phase, learning-based approach to solving this problem. In the first phase,

a quadratic motion model is used to produce an initial classification that is valid for a

short portion of the timeline. In the second phase, Bayesian methods are used to learn the

long-term, unrestricted motion of the underlying objects.

1.3.3 Selection Queries over Probabilistic Spatiotemporal Databases

Finally, I consider the problem of answering selection queries in the presence of

uncertainty incorporated through a probabilistic model. One way to facilitate the

representation of uncertainty in a spatiotemporal database is by allowing tuples to

have “probabilistic attributes” whose actual values are unknown, but are assumed

to be selected by sampling from a specified distribution. This can be supported by

including a few, pre-specified, common distributions in the database system when it is

shipped. However, to be truly general and extensible and support distributions that

19

cannot be represented explicitly or even integrated, it is necessary to provide an interface

that allows the user to specify arbitrary distributions by implementing a function that

produces pseudo-random samples from the desired distribution. Allowing a user to specify

uncertainty via arbitrary sampling functions creates several interesting technical challenges

during query evaluation. Specifically, evaluating time-instance selection queries such as

“Find all vehicles that are in close proximity to one another with probability p at time

t” requires the principled use of Monte Carlo statistical methods to determine whether

the query predicate holds. To support such queries, the thesis describes new methods

that draw heavily for the relevant statistical theory on sequential estimation. I also

consider the problem of indexing for the Monte Carlo algorithms, so that samples from the

pseudo-random attribute value generator can be pre-computed and stored in a structure in

order to answer subsequent queries quickly.

Organization. The rest of this study is organized as follows. Chapter 2 provides

a survey of work related to the problems addressed in this thesis. Chapter 3 tackles the

scalability issue when processing join queries over massive spatiotemporal databases.

Chapter 4 describes an approach to handling the entity-resolution problem in cleaning

spatiotemporal data sources. Chapter 5 describes a simple and general approach to

answering selection queries over spatiotemporal databases that incorporate uncertainty

within a probabilistic model framework (selection queries over probabilistic spatiotemporal

databases). Chapter 6 concludes the dissertation by summarizing the contributions and

identifying potential directions for future work.

20

CHAPTER 2BACKGROUND

This section provides a survey of literature related to the three problems addressed in

this dissertation.

2.1 Spatiotemporal Join

Though research in spatiotemporal joins is relatively new, the closely related problem

of processing joins over spatial objects has been extensively studied. The classical paper

in spatial joins is due to Brinkhoff et al. [45]. Their approach assumes the existence

of a hierarchical spatial index, such as an R-tree [16], on the underlying relations. The

join Brinkhoff proposes makes use of a carefully synchronized depth-first traversal of the

underlying indices to narrow down the candidate pairs. A breadth-first strategy with

several additional optimizations is considered by Huang et al. [46]. Lo and Ravishankar

[47] explore a non-index based approach to processing a spatial join. They consider how

to extend the traditional hash join algorithm to the spatial join problem and propose a

strategy based on a partitioning of the database objects into extent mapping hash buckets.

A similar idea, referred to as the partition-based spatial merge (PBSM), is considered

by Patel et al. [48]. Instead of partitioning the input data objects, they consider a grid

partitioning of the data space on to which objects are mapped. This idea is further

extended by Lars et al. [49], where they propose a dynamic partitioning of the input space

into vertical strips. Their strategy avoids the data spill problem encountered by previous

approaches since the strips can be constructed such that they fit within the available main

memory.

A common theme among existing approaches is their use of the plane-sweep [50] as a

fast pruning technique. In the case of index-based algorithms, plane-sweep is used to filter

candidate node pairs enumerated from the traversal. Non-indexed based algorithms make

use of the plane-sweep to construct candidate sets over partitions.

21

To our knowledge, the only prior work on spatiotemporal joins is due to Jeong et al.

[51]. However, they only consider spatiotemporal join techniques that are straightforward

extensions to traditional spatial join algorithms. Further, they limit their scope to

index-based algorithms for objects over limited time windows.

2.2 Entity Resolution

Research in entity resolution has a long history in databases [52–55] and has focused

mainly on integrating non-geometric string based data from noisy external sources. Closely

related to the work in this thesis is the large body of work on target tracking that exists

in fields as diverse as signal processing, robotics, and computer vision. The goal in target

tracking [56, 57] is to support the real-time monitoring and tracking of a set of moving

objects from noisy observations.

Various algorithms to classify observations among objects can be found in the

target tracking literature. They characterize the problem as one of data association (i.e.

associating observations with corresponding targets). A brief summary of the main ideas is

given below.

The seminal work is due to Reid [58] who propose a multiple hypothesis technique

(MHT) to solve the tracking problem. In the MHT approach, a set of hypotheses is

maintained with each hypothesis reflecting the belief on the location of an individual

target. When a new set of observations arrive, the hypotheses are updated. Hypotheses

with minimal support are deleted and additional hypotheses are created to reflect new

evidence. The main drawback of the approach is that the number of hypotheses can grow

exponentially over time. Though heuristic filters [59–61] can be used to bound the search

space, it limits the scalability of the algorithm.

Target tracking also has been studied using Bayesian approaches [62]. The Bayesian

approach views tracking as a state estimation problem. Given some initial state and a

set of observations, the goal is to predict the object’s next state. An optimal solution to

the problem is given by Bayes Filter [63, 64]. Bayes filters produces optimal estimates by

22

integrating over the complete set of observations. The formulation is often recursive and

involves complex integrals that are difficult to solve analytically. Hence, approximation

schemes such as particle filters [57] and sequential Monte Carlo techniques [63] are often

used in practice.

Recently, Markov Chain Monte Carlo (MCMC) [65, 66] techniques have been

proposed. MCMC techniques attempt to approximate the optimal Bayes filter for multiple

target tracking. MCMC based methods employ sequential MC sampling and are shown to

perform better than existing sub-optimal approaches such as MHT for tracking objects in

highly cluttered environments.

A common theme among most of the research in target tracking is its focus on

accurate tracking and detection of objects in real time in highly cluttered environments

over relatively short time periods. In a data warehouse context, the ability of techniques

such as MCMC to make fine-grained distinctions make them ideal candidates when

performing operations such as drilldown that involve analytics over small time windows.

Their applicability is limited, however, to entity resolution in a data warehouse. In such a

context, summarization and visualization of historical trajectories smoothed over long time

intervals is often more useful. The model-based approach considered in this work seems a

more suitable candidate for such tasks.

2.3 Probabilistic Databases

Uncertainty management in spatiotemporal databases is a relatively new area of

research. Earlier work has focused on aspects of modeling uncertainty and query language

support [9, 67].

In the context of query processing, one of the earliest papers in this area is the

paper by Pfoser et al. [68] where different sources of uncertainty are characterized and

a probability density function is used to model errors. Hosbond et al. [69] extended this

work by employing a hyper square uncertainty region, which expands over time to answer

queries using a TPR-tree.

23

Trajcevksi et al. [70] study the problem from a modeling perspective. They model

trajectories by a cylindrical volume in 3D and outline semantics of fuzzy selection queries

over trajectories in both space and time. However, the approach does not specify how to

choose the dimensions of the cylindrical region which may have to change over time to

account for shrinking or expanding of the underlying uncertainty region.

Cheng et al. [71] describe algorithms for time instant queries (probabilistic range

and nearest neighbor) using an uncertainty model where a probabilty density function

(PDF) and an uncertain region is associated with each point object. Given a location in

the uncertain region, the PDF returns the probablity of finding the object at that location.

A similar idea is used by Tao et al. [72] to answer queries in spatial databases. To handle

time interval queries, Mokhtar et al. [73] represent uncertain trajectories as a stochastic

process with a time-parametric uniform distribution.

24

CHAPTER 3SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES

In applications that produce a large amount of data describing the paths of

moving objects, there is a need to ask questions about the interaction of objects

over a long recorded history. In this chapter, the problem of computing joins over

massive moving object histories is considered. The particular join studied is the

“Closest-Point-Of-Approach” join, which asks: Given a massive moving object history,

which objects approached within a distance d of one another?

3.1 Motivation

Frequently, it is of interest in applications which make use of spatial data to ask

questions about the interaction between spatial objects. A useful operation that enables

one to answer such questions is the spatial join operation. Spatial join is similar to the

classical relational join except that it is defined over two spatial relations based on a

spatial predicate. The objective of the join operation is to retrieve all object pairs that

satisfy a spatial relationship. One common predicate involves distance measures, where

we are interested in objects that were within a certain distance of each other. The query

“Find all restaurants within distance 10 miles from a hotel” is an example of a spatial

join.

For moving objects, the spatial join operation involves the evaluation of both a spatial

and a temporal predicate and for this reason the join is referred to as a spatiotemporal

join. For example, consider the spatial relations PLANES and TANKS, where each relation

represents accumulated trajectory data of planes and tanks from a battlefield simulation.

The query “Find all planes that are within distance 10 miles of a tank” is an example of a

spatiotemporal join. The spatial predicate in this case restricts the distance (10 miles) and

the temporal predicate restricts the time period to the current time instance.

In the more general case, the spatiotemporal join is issued over a moving object

history, which contains all of the past locations of the objects stored in a database. For

25

example, consider the query “Find all pairs of planes that came within distance of 1000

feet during their flight path”. Since there is no restriction of the temporal predicate,

answering this query involves an evaluation of the spatial predicate at every time instance.

The amount of data to be processed can be overwhelming. For example, in a typical

flight, the flight data recorder stores about 7 MB of data which records among other

things, the position and time of the flight for every second during its operation. Given

that on average the US Air Traffic Control handles around 30000 flights in a single day,

if all of this data were archived, it would result in 200 GB of data accumulation just

for a single day. For another example, it is not uncommon for scientific simulations to

output terabytes or even petabytes of spatiotemporal data (see Abdulla et al.[7] and the

references contained therein).

In this chapter, the spatial-temporal join problem for moving object histories in

three-dimensional space, with time considered as the fourth dimension is investigated. The

spatiotemporal join operation considered is the CPA Join (Closest-Point-Of-Approach

Join). By Closest Point of Approach, we refer to a position at which two moving objects

attain their closest possible distance[74]. Formally, in a CPA Join, we answer queries of

the following type: “Find all object pairs (p ∈ P, q ∈ Q) from relations P and Q such that

CPA-distance (p, q) ≤ d”. The goal is to retrieve all object pairs that are within a distance

d at their closest-point-of-approach.

Surprisingly, this problem has not been studied previously. The spatial join problem

has been well-studied for stationary objects in two- and three-dimensionsal space [45, 47–

49], however very little work related to spatiotemporal joins can be found in literature.

There has been some work related to joins involving moving objects [75, 76] but the work

has been restricted to objects in a limited time window and does not consider the problem

of joining object histories that may be gigabytes or terabytes in size.

26

The contributions can be summarized as follows:

• Three spatiotemporal join strategies for data involving moving object histories ispresented.

• Simple adaptations of existing spatial join processing algorithms, based on the R-treestructure and on a plane-sweeping algorithm, for spatiotemporal histories is explored.

• To address the problems associated with straightforward extensions to thesetechniques, a novel join strategy for moving objects based on an extension of thebasic plane-sweeping algorithm is described.

• A rigorous evaluation and benchmarking of the alternatives is provided. Theperformance results suggest that we can obtain significant speedup in execution timewith the adaptive plane-sweeping technique.

The rest of this chapter is organized as follows: In Section 3.2, the closest point

of approach problem is reviewed. In Sections 3.3 and 3.4, two obvious alternatives to

implementing the CPA join – using R-trees and plane-sweeping – is described. In Section

3.5, a novel adaptive plane-sweeping technique that outperforms competing techniques

considerably is presented. Results from our benchmarking experiments are given in Section

3.6. Section 3.7 outlines related work.

3.2 Background

In this Section, we discuss the motion of moving objects, and give an intutive

description of the CPA problem. This is followed by an analytic solution to the CPA

problem over a pair of points moving in a straight line.

3.2.1 Moving Object Trajectories

Trajectories describe the motion of objects in a 2 or 3-dimensional space. Real-world

objects tend to have smooth trajectories and storing them for analysis often involves

approximation to a polyline. A polyline approximation of a trajectory connects object

positions, sampled at discrete time instances, by line segments (Figure 3-1).

In a database the trajectory of an object can be represented as a sequence of the form

〈(t1, ~v1), (t2, ~v2), . . . , (tn, ~vn)〉 where each ~vi represents the position vector of the object at

27

t6

t7

t8

(B)(A)

t3

y

x

t0

t1

t2

t4

t5

t10

t9

Figure 3-1. Trajectory of an object (a) and its polyline approximation (b)

time instance ti. The arity of the vector describes the dimensions of the space. For flight

simulation data, the arity would be 3, whereas for a moving car, the arity would be 2. The

position of the moving objects is normally obtained in one of several ways: by sampling or

polling the object at discrete time instances, through use of devices like GPS, etc.

3.2.2 Closest Point of Approach (CPA) Problem

We are now ready to describe the CPA problem. Let CPA(p,q,d) over two straight

line trajectories be evaluated as follows. Assuming the distance between the two objects is

given by mindist, then we output true if mindist < d (the objects were within distance d

during their motion in space), otherwise false. We refer to the calculation of CPA(p,q,d)

as the CPA problem.

The minimum distance mindist between two objects is the distance between the

object positions at their closest point of approach. It is straightforward to calculate

mindist once the CPA time tcpa, time instance at which the objects reached their closest

distance, is known.

We now give an analytic solution to the CPA problem for a pair of objects on a

simple straight-line trajectory.

Calculating the CPA time tcpa. Figure 3-2 shows the trajectory of two objects p

and q in 2-dimensional space for the time period [tstart, tend]. The position of these objects

at any time instance t is given by p(t) and q(t). Let their positions at time t = 0 be p0

and q0 and let their velocity vectors per unit of time be u and v. The motion equations for

28

t0

t3t4

t1t2

tcpa

t4t3

t2t0

q

distcpa

tstarttend

t1 p

Figure 3-2. Closest Point of Approach Illustration

q[3]p[3]

q[1]

qp

p[2]

p[1]

q[2]

y

x

t

Figure 3-3. CPA Illustration with trajectories

these two objects are p(t) = p0 + tu; q(t) = q0 + tv. At any time instance t, the distance

between the two objects is given by d(t) = |p(t)− q(t)|.Using basic calculus, one can find the time instance at which the distance d(t) is

minimum (when D(t) = d(t)2 is a minimum). Solving for this time we obtain:

tcpa =−(po − qo).(u− v)

|u− v|2

Given this, mindist is given by |p(tcpa)− q(tcpa)|.The distance calculation that we described above is applicable only between two

objects on a straight line trajectory. To calculate the distance between two objects on a

polyline trajectory, we apply the same basic technique. For trajectories consisting of a

chain of line-segments, we find the minimum distance by first determining the distance

between each pair of line-segments and then choosing the minimum distance.

As an example, consider Figure 3-3 which shows the trajectory of two objects in

2-dimensional space with time as the third dimension. Each object is represented by

29

an array that stores the chain of segments comprising the trajectory. The line-segments

are labeled by the array indices. To determine the qualifying pairs, we find the CPA

distance between the line segment pairs (p[1],q[1]) (p[1],q[2]) (p[1],q[3]) (p[2],q[1])

(p[2],q[2]) (p[2],q[3]) (p[3],q[1]) (p[3],q[2]) (p[3],q[3]) and return the pair with the minimum

distance among all evaluated pairs. The complete code for computing CPA(p,q,d) over

multi-segment trajectories is given as Algorithm 3-1.

Algorithm 1 CPA (Object p, Object q, distance d)1: mindist = ∞2: for (i = 1 to p.size) do3: for (j = 1 to q.size) do4: tmp = CPA Distance (p[i], q[j])5: if (tmp ≤ mindist) then6: mindist = tmp

7: end if8: end for9: end for

10: if (mindist ≤ d) then11: return true12: end if13: return false

In the next two Sections, we consider two obvious alternatives for computing the

CPA Join, where we wish to discover all pairs of objects (p, q) from two relations P and

Q, where CPA (p, q,d) evaluates to true. The first technique we describe makes use of an

underlying R-tree index structure to speed up join processing. The second methodology is

based on a simple plane-sweep.

3.3 Join Using Indexing Structures

Given numerous existing spatiotemporal indexing structures, it is natural to first

consider employing a suitable index to perform the join.

Though many indexing structures exist, unfortunately most are not suitable for the

CPA Join. For example, a large number of indexing structures like the TPR-tree [17],

REXP tree [77], TPR*-tree [78] have been developed to support predictive queries, where

30

the focus is on indexing the future position of an object. However, these index structures

are generally not suitable for CPA Join, where access to the entire history is needed.

Indexing structures like MR-tree [26], MV3R-tree [27], HR-tree [28], HR+-tree [27]

are more relevant since they are geared towards answering time instance queries (in case

of MV3R-tree also short time-interval queries), where all objects alive at a certain time

instance are retrieved. The general idea behind these index structures is to maintain a

separate spatial index for each time instance. However, such indices are meant to store

discrete snapshots of a evolving spatial database, and are not ideal for use with CPA Join

over continuous trajectories.

3.3.1 Trajectory Index Structures

More relevant are indexing structures specific to moving object trajectory histories

like the TB-tree, STR-tree [21] and SETI [18]. TB-trees emphasize trajectory preservation

since they are primarily designed to handle topological queries where access to entire

trajectory is desired (segments belonging to the same trajectory are stored together). The

problem with TB-trees in the context of the CPA Join is that segments from different

trajectories that are close in space or time will be scattered across nodes. Thus, retrieving

segments in a given time window will require several random I/Os. In the same paper

[21], a STR tree is introduced that attempts to somewhat balance spatial locality with

trajectory preservation. However, as the authors point out STR-trees turn out to be a

weak compromise that do not perform better than traditional 3D R-trees [20] or TB-trees.

More appropriate to the CPA Join is SETI [18]. SETI partitions two-dimensional space

statically into non-overlapping cells and uses a separate spatial index for each cell. SETI

might be a good candidate for CPA Join since it preserves spatial and temporal locality.

However, there are several reasons why SETI is not the most natural choice for a CPA

Join:

• It is not clear that SETI’s forest scales to a three-dimensional space. A 25× 25 SETIgrid in two-dimension becomes a sparse 25× 25× 25 grid with almost 20, 000 cells inthree-dimension.

31

• SETI’s grid structure is an interesting idea for addressing problems with highvariance in object speeds (we will use a related idea for the adaptive plane-sweepalgorithm described later). However, it is not clear how to size the grid for a givendata set, and sizing it for a join seems even harder. It might very well be thatrelation R should have a different grid for R ./ S compared to R ./ T .

• For a CPA Join over a limited history, SETI has no way of pruning the search space,since every cell will have to be searched.

3.3.2 R-tree Based CPA Join

Given these caveats, perhaps the most natural choice for the CPA Join is the R-tree

[16]. The R-tree [16] is a hierarchical, multi-dimensional index structure that is commonly

used to index spatial objects. The join problem has been studied extensively for R-trees

and several spatial join techniques exist [45, 46, 79] that leverage underlying R-tree

index structures to speed-up join processing. Hence, our first inclination is to consider a

spatiotemporal join strategy that is based on R-trees. The basic idea is to index object

histories using R-trees and then perform a join over these indices.

The R-Tree Index

It is a very straightforward task to adapt the R-tree to index a history of moving

object trajectories. Assuming three spatial dimensions and a fourth temporal dimension,

the four-dimensional line segments making up each individual object trajectory are simply

treated as individual spatial objects and indexed directly by the R-tree. The R-tree and

its associated insertion or packing algorithms are used to group those line segments into

disk-page sized groups, based on proximity in their four-dimensional space. These pages

make up the leaf level of the tree. As in a standard R-tree, these leaf pages are indexed

by computing the minimum bounding rectangle that encloses the set of objects stored in

each leaf page. Those rectangles are in turn grouped into disk-page sized groups which are

themselves indexed. An R-tree index for 3 line segments moving through 2-dimensional

space is depicted in Figure 3-4.

32

p1[3]p3[2]p1[2] p1[4]p3[3] p2[4]

p3[1]

p2[3]

p3[2]

p3[3]

p1[4]

p2[1]

p2[2]

p1[3]

I1

I2

I3

ty

x

p2[4]

I1 I2 I3

p1[2]

p1[1]

p2[3]

p1[2]

p1[3]

p3[2]

p2[3]

p1[4]p2[4]

p3[3]

p2[2]p3[1]p2[1]p1[1]

p1[1]

p2[1]

p2[2]

p3[1]

Figure 3-4. Example of an R-tree

Basic CPA Join Algorithm Using R-Trees

Assuming that the two spatiotemporal relations to be joined are organized using

R-trees, we can use one of the standard R-tree distance joins as a basis for the CPA Join.

The common approach to joins using R-trees employ carefully controlled synchronized

traversal of the two R-trees to be joined. The pruning power of the R-tree index arises

from the fact that if two bounding rectangles R1 and R2 do not satisfy the join predicate

then the join predicate is not satisfied between any two bounding rectangles that can be

enclosed within R1 or R2.

In a synchronized technique, both the R-trees are simultaneously traversed retrieving

object-pairs that satisfy the join predicate. To begin with, the root nodes of both the

R-trees are pushed into a queue. A pair of nodes from the queue is processed by pairing

up every entry of the first node with every entry in the second node to form the candidate

set for further expansion. Each pair in the candidate set that qualifies the join predicate is

pushed into the queue for subsequent processing. The strategy described leads to a BFS

(Breadth-First-Search) expansion of the trees. BFS-style traversal lends itself to global

optimization of the join processing steps [46] and works well in practice.

33

d2

l1

P2

l2

P1

darbit

d1

x

y

z

dreal

Figure 3-5. Heuristic to speed up distance computation

The distance routine is used in evaluating the join predicate to determine the distance

between two bounding rectangles associated with a pair of nodes. A node-pair qualifies

for further expansion if the distance between the pair is less than the limiting distance d

supplied by the query.

Heuristics to Improve the Basic Algorithm

The basic join algorithm can be improved in several ways by using several standard

and non-standard techniques for reducing I/O and CPU costs over spatial joins. These

include:

• Using a plane-sweeping algorithm [45] to speed up the all-pairs distance computationwhen pairs of nodes are expanded and their children are checked for possiblematches.

• Carefully considering the processing of node pairs so that when each pair isconsidered, one or both of the nodes are in the buffer [46].

• Avoiding expensive distance computations by applying heuristic filters. Computingthe distance between two 3-dimensional rectangles can be a very costly operation,since the closest points may be on arbitrary positions on the faces of the rectangles.To speed this computation, the magnitudes of the diagonals of the two rectangles(d1 and d2) can be computed first. Next, we pick an arbitrary point from both ofthe rectangles (points P1 and P2), and compute the distance between them, calleddarbit. If darbit − d1 − d2 > djoin, then the two rectangles cannot contain any pointsas close as djoin from one another and the pair can be discarded, as shown in Figure3-5. This provides for immediate dismissals with only three distance computations(or one if the diagonal distances are precomputed and stored with each rectangle).

34

object p

object q

ty

x

Figure 3-6. Issues with R-trees- Fast moving object p joins with everyone

In addition, there are some obvious improvements to the algorithm that can be made

which are specific to the 4-dimensional CPA Join:

• The fourth dimension, time, can be used as an initial filter. If two MBRs or linesegments do not overlap in time, then the pair cannot possibly be a candidate for aCPA match.

• Since time can be used to provide for immediate dismissals without Euclideandistance calculations it is given priority over the attributes. For example, when aplane-sweep is performed to prune an all-pairs CPA distance computation, time isalways chosen as the sweeping axis. The reason is that time will usually have thegreatest pruning power of any attributes since time-based matches must always beexact, regardless of the join distance.

• In our implementaion of the CPA Join for R-trees, we make use of the STR packingalgorithm [80] to build the trees. Because the potential pruning power of the timedimension is greatest, we ensure that the trees are well-organized with respect totime by choosing time as the first packing dimension.

Problem With R-tree CPA Join

Unfortunately, it turns out that in practice the R-tree can be ill-suited to the problem

of computing spatiotemporal joins over moving object histories. R-trees have a problem

handling databases with a high variance in object velocities. The reason for this is that

join algorithms which make use of R-trees rely on tight and well-behaved minimum

bounding rectangles to speed the processing of the join. When the positions of a set of

moving objects are sampled at periodic intervals, fast moving objects tend to produce

larger bounding rectangles than slow moving objects.

35

p1

q2

q1

p2

y

time

tendtstart

Figure 3-7. Progression of plane-sweep

One such scenario is depicted in Figure 3-6, which shows the paths of a set of objects

on a 2-D plane for a given time period. A fast moving object such as p will be contained

in a very large MBR, while slower objects such as q will be contained in much smaller

MBRs. When a spatial join is computed over R-trees storing these MBRs, the MBR

associated with p can overlap many smaller MBRs, and each overlap will result in an

expensive distance computation (even if the objects do not travel close to one another).

Thus, any sort of variance in object velocities can adversely affect the performance of the

join.

3.4 Join Using Plane-Sweeping

The second technique that is considered is a join strategy based on a simple plane-

sweep. Plane-sweep is a powerful technique for solving proximity problems involving

geometric objects in a plane and has previously been proposed [49] as a way to efficiently

compute the spatial join operation.

3.4.1 Basic CPA Join using Plane-Sweeping

Plane-sweep is an excellent candidate for use with the CPA join because no matter

what distance threshold is given as input into the join, two objects must overlap in the

time dimension for there to be a potential CPA match. Thus, given two spatiotemporal

relations P and Q, we could easily base our implementation of the CPA join on a

plane-sweep along the time dimension.

36

We would begin a plane-sweep evaluation of the CPA join by first sorting the intervals

making up P and Q along the time dimension, as depicted in Figure 3-7. We then sweep

a vertical line along the time dimension. A sweepline data structure D is maintained

which keeps track of all line segments which are valid given the current position of the line

along the time dimension. As the sweepline progresses, D is updated with insertions (new

segments that became active) and deletions (segments whose time period has expired).

Segment pairs from both input relations that satisfy the join predicate are always present

in D, and they can be checked and reported during updates to D. Pseudo-code for the

algorithm is given below:

Algorithm 2 PlaneSweep (Relation P, Relation Q, distance d)

1: Form a single list L containing segments from P and Q sorted by tstart

2: Initialize sweepline data structure D3: while not IsEmpty (L) do4: Segment top = popFront (L)5: Insert (D, top)6: Delete from D all segments s s.t. (s.tend < top.tstart) {remove segments that donot

intersect sweepline}7: Query (D, top, d) {report segments in D that are within distance dist}8: end while

In the case of the CPA join, assuming that all moving objects at any given moment

can be stored in main memory, any of a number of data structures can be used to

implement D, such as a quad- or oct-tree, or an interval skip-list [81]. The main

requirement is that the data structure selected should easily be possible to check proximity

of objects in space.

3.4.2 Problem With The Basic Approach

Although the plane-sweep approach is simple, in practice it is usually too slow to

be useful for processing moving object queries. The problem has to do with how the

sweepline progression takes place. As the sweepline moves through the data space, it has

to stop momentarily at sample points (time instances at which object positions where

recorded) to process newly encountered segments into the data structure D. New segments

37

y

time

tend

q2

q1

p2

p1

tstart

Figure 3-8. Layered Plane-Sweep

that are encountered at the sample point are added into the data structure and segments

in D that are no longer active are deleted from it.

Consequently, the sweepline pauses more often when objects with high sampling rates

are present, and the progress of the sweepline is heavily influenced by the sampling rates

of the underlying objects. For example, consider Figure 3-7 which shows the trajectory

of four objects in a given time period. In the case illustrated, object p2 controls the

progression of the sweepline. Observe that in the time-interval [tstart, tend], only new

segments from object p2 get added to D but expensive join computations are performed

each time with same set of line segments.

The net result is that if the sampling rate of a data set is very high relative to the

amount of object movement in the data set, then processing a multi-gigabyte object

history using a simple plane-sweeping algorithm may take a prohibitively long time.

3.4.3 Layered Plane-Sweep

One way to address this problem is to reduce the number of segment level comparisons

by comparing the regions of movement of various objects at a coarser level. For example,

reconsider the CPA join depicted in Figure 7. If we were to replace the many oscillations

of object p2 with a single minimum bounding rectangle which enclosed all of those

oscillations from tstart to tend, we could then use that rectangle during the plane-sweep

38

as an intial approximation to the path of object p2. This would potentially save many

distance computations.

This idea can be taken to its natural conclusion by constructing a minimum bounding

box that encompasses the line-segments of each object. A plane-sweep is then performed

over the bounding boxes, and only qualifying boxes are expanded further. We refer to this

technique as the Layered Plane-Sweep approach since plane-sweep is performed at two

layers – one at a coarser level of bounding boxes and then at the finer level of individual

line segments.

One issue that must be considered is how much movement is to be summarized

within the bounding rectangle for each object. Since we would like to eliminate as many

comparisons as possible, one natural choice would be to let the available system memory

dictate how much movement is covered for each object. Given a fixed buffer size, the

algorithm will proceed as follows.

Algorithm 3 LayeredPlaneSweep(Relation P , Relation Q, distance d)

1: Segment s defined by [(xsart, xend), (ystart, yend), (zstart,zend), (tstart, tend)]2: Assume a sorted list of object segments (by tstart) in disk3: while there is still some unprocessed data do4: Read in enough data from P and Q to fill the buffer5: Let tstart be the first time tick which has not yet been processed by the plane-sweep6: Let tend be the last time tick for which no data is still on disk7: Next, bound the trajectory of every object present in the buffer by a MBR8: Sort the MBRs along one of the spatial dimension and then perform a plane-sweep

along that dimension9: Expand the qualifying MBR pairs to get the actual trajectory data (line segments)

10: Sort the line segments by tstart

11: Perform a final sweep along the time dimension to get the final result set12: end while

Figure 3-8 illustrates the idea. It depicts the snapshot of object trajectories starting

at some time instance tstart. Segments in the interval [tstart, tend] represent the maximum

that can be buffered in the available memory. A first level plane-sweep is carried out over

the bounding boxes to eliminate false positives. Qualifying objects are expanded and a

second-level plane-sweep is carried out over individual line-segments. In the best case,

39

tstart

q2q2

tend

time

q1

p2

p1

y

Figure 3-9. Problem with using large granularities for bounding box approximation

there is an opportunity to process the entire data set through just three comparisons at

the MBR level.

3.5 Adaptive Plane-Sweeping

While the layered plane-sweep typically performs far better than the basic plane-sweeping

algorithm, it may not always choose the proper level of granularity for the bounding box

approximations. This Section describes an adaptive strategy that takes into careful

consideration the underlying object interaction dynamics and adjusts this granularity

dynamically in response to the underlying data characteristics.

3.5.1 Motivation

In the simple layered plane-sweep, the granularity for the bounding box approximation

is always dictated by the available system memory. The assumption is that pruning power

increases monotonically with increasing granularity. Unfortunately, this is not always the

case. As a motivating example, consider Figure 3-9. Assume available system memory

allows us to buffer all the line segments. In this case, the layered plane-sweep performs

no better than the basic plane-sweep, due to the fact that all the object bounding

boxes overlap with each other and as a result no pruning is achieved at the first-level

plane-sweep.

However, assume we had instead fixed the granulatiry to correspond to the time

period [tstart, ti], as depicted in Figure 3-10. In this case, none of the bounding boxes

40

overlap, and there are possibly many dismissals at the first level. Though less of the

buffer is processed intitially, we are able to eliminate many of the segment-level distance

comparisons compared to a technique that bounds the entire time-period, thereby

potentially increasing the efficiency of the algorithm. The entire buffer can then be

processed in a piece-by-piece fashion, as depicted in Figure 3-10. In general, the efficiency

of the layered plane-sweep is tied not to the granularity of the time interval that is

processed, but the granularity that minimizes the number of distance comparisons.

3.5.2 Cost Associated With a Given Granularity

Since distance computations dominate the time required to compute a typical CPA

Join, the cost associated with a given granularity can be approximated as a function of the

number of distance comparisons that are needed to process the segments encompassed in

that granularity. Let nMBR be the number of distance computations at the box-level, let

nseg be the number of distance calculations at the segment-level, and let α be the fraction

of the time range in the buffer which is processed at once. Then the cost associated with

processing that fraction of the buffer can be estimated as:

costα = (nseg + nMBR)(1− α)

This function reflects the fact that if we choose a very small value for α, we will have to

process many cut-points in order to process the entire buffer, which can increase the cost

of the join. As α shrinks, the algorithm becomes equivalent to the traditional plane-sweep.

On the other hand, choosing a very large value for α tends to increase (nseg + nMBR),

eventually yielding an algorithm which is equivalent to the simple, layered plane-sweep. In

practice, the optimal value for α lies somewhere in between the two extremes, and varies

from data set to data set.

3.5.3 The Basic Adaptive Plane-Sweep

Given this cost function, it is easy to design a greedy plane-sweep algorithm that

attempts to repeatedly minimize costα in order to adapt the underlying (and potentially

41

tendtj

p1

p2

q1

q2

time

y

titstart

Figure 3-10. Adaptively varying the granularity

time-varying) characteristics of the data. At every iteration, the algorithm simply chooses

to process the fraction of the buffer that appears to minimize the overall cost of the

plane-sweep in terms of the expected number of distance computations. The algorithm is

given below:

Algorithm 4 AdaptivePlaneSweep(Relation P , Relation Q, distance d)

1: while there is still some unprocessed data do2: Read in enough data from P and Q to fill the buffer3: Let tstart be the first time tick which has not yet been processed by the plane-sweep4: Let tend be the last time tick for which no data is still on disk5: Choose α so as to minimize costα6: Perform a layered plane-sweep from time tstart to tstart + α × (tend − tstart) {steps 5-9

of procedure LayeredPlaneSweep}7: end while

Unfortunately, there are two obvious difficulties involved with actually implementing

the above algorithm:

• First, the cost costα associated with a given granularity is known only after thelayered plane has been executed at that granularity.

• Second, even if we can compute costα easily, it is not obvious how we can computecostα for all values of α from 0 to 1 so as to minimize costα over all α.

These two issues are discussed in detail in the next two Sections.

42

3.5.4 Estimating Costα

This Section describes how to efficiently estimate costα for a given α using a simple,

online sampling algorithm reminiscent of the algorithm of Hellerstein and Haas [82].

At a high level, the idea is as follows. To estimate costα, we begin by constructing

bounding rectangles for all of the objects in P considering their trajectories from time

tstart to α(tend− tstart). These rectangles are then inserted into an in-memory index, just as

if we were going to perform a layered plane-sweep. Next, we randomly choose an object q1

from Q, and construct a bounding box for its trajectory as well. This object is joined with

all of the objects in P by using the in-memory index to find all bounding boxes within

distance d of q1. Then:

• Let nMBR,q1 be the number of distance computations needed by the index tocompute which objects from P have bounding rectangles within distance d of thebounding rectangle for q1, and

• Let nseg,q1 be the total number of distance computations that would have beenneeded to compute the CPA distance between q1 and every object p ∈ P whosebounding rectangle is within distance d of the bounding rectangle for q1 (this can becomputed efficiently by performing a plane-sweep without actually performing therequired distance computations).

Once nMBR,q1 and nseg,q1 have been computed for q1, the process can be repeated for a

second randomly selected object q2 ∈ Q, for a third object q3 and so on. A key observation

is that after m objects from Q have been processed, the value

µm =1

m

m∑i=1

(nMBR,q1 + nseg,q1)|Q|

represents an unbiased estimator for (nMBR + nseg) at α, where |Q| denotes the number of

data objects in Q.

In practice, however, we are not only interested in µm. We would also like to know at

all times just how accurate our estimate µm is, since at the point where we are satisfied

with our guess as to the real value of costα, we want to stop the process of estimating

costα and continue with the join.

43

Fortunately, the central limit theorem can easily be used to estimate the accuracy

of µm. Assuming sampling with replacement from Q, for large enough m the error of our

estimate will be normally distributed around (nMBR + nseg) with variance σ2m = 1

mσ2(Q),

where σ2(Q) is defined as

1

|Q||Q|∑i=1

{(nMBR,q1 + nseg,q1)|Q| − (nMBR + nseg)}2

Since in practice we cannot know σ2(Q), it must be estimated via the expression

σ2(Qm) =1

m− 1

m∑i=1

{(nMBR,q1 + nseg,q1)|Q| − µm}

(Qm denotes the sample of Q that is obtained after m objects have been randomly

retrieved from Q). Substituting into the expression for σ2m, we can treat µm as a normally

distributed random variable with variance bσ2(Qm)m

.

In our implementation of the adaptive plane-sweep, we can continue the sampling

process until our estimate for costα is accurate to within ±10% at 95% confidence. Since

95% of the standard normal distribution falls within two standard deviations of the mean,

this is equivalent to sampling until√bσ2(Qm)

mis less than µm × 0.1.

3.5.5 Determining The Best Cost

We now address the second issue: how to compute costα for all values of α from 0 to

1 so as to minimize costα over all α.

Calculating costα for all possible values of α is prohibitively expensive and hence not

feasible in practice. Fortunately, in practice we do not have to evaluate all values of α to

determine the best α. This is due to the following interesting fact: “If we plot all possible

values of α and their respective associated cost, we would observe that the graph is not

linear, but exhibits a certain concavity. The concave region of the graph represents a sweet

spot and represents the feasible region for the best cost.”

As an example consider Figure 3-11, which shows the plot of the cost function for

various fractions of α for one of the experimental data sets from Section 3.6. Given this

44

2e+07

3e+07

4e+07

5e+07

6e+07

7e+07

8e+07

9e+07

0 10 20 30 40 50 60 70 80 90 100#

of D

ista

nce

Com

puta

tions

(es

timat

ed)

% of buffer

Convexity of Cost Function for k = 20

Figure 3-11. Convexity of cost function illustration.

fact, we identify the feasible region by evaluating costαifor a small number, k, of αi

values. Given k the number of allowed cutpoints, the fraction α1 can be determined as

follows:

α1 =r( 1

k)

r

where r = (tend − tstart) is the time range described by the buffer (the above formula

assumes that r is greater than one; if not, then the time range should be scaled accordingly).

In the general case, the fraction of the buffer considered by any αi(1 ≤ i ≤ k) is given by:

αi =(r · α1)

i

r

Note that since the growth rate of each subsequent αi is exponential, we can cover

the entire buffer with just a small k and still guarantee that we will consider some value

of αi that is within a factor of α1 from the optimal. After computing α1, α2,. . . ,αk, we

successively evaluate increasing buffer fractions, α1, α2, α3, and so on and determine their

associated costs. From these k costs we determine the αi with the minimum cost.

Note that if we choose α based on the evaluation of a small k, then it is possible that

the optimal choice of α may lie outside the feasible region. However, there is a simple

approach to solving this issue. After an initial evaluation of k granularities, consider

just the region starting before and ending after the best k and recursively reapply the

evaluation described above just in this region.

45

α3α2

α4 α5

α5

α3

tj

ti

tendtstart

α5

mincost

mincost

mincost

α3α2α1

α4

α4

α1

α1

α2

Figure 3-12. Iteratively evaluating k cut points

For instance, assume we chose αi after evaluation of k cutpoints in the time range r.

To further tune this αi, we consider the time range defined between the adjacent cutpoints

αi−1 and αi+1 and recursively apply cost estimation in this interval. (i.e., evaluate k points

in the time range (tstart + αi−1 × r, tstart + αi+1 × r)). Figure 3-12 illustrates the idea. This

approach is simple and very effective in considering a large number of choices of α.

3.5.6 Speeding Up the Estimation

Restricting the number of candidate cut points can help keep the time required to

find a suitable value for α manageable. However, if the estimation is not implemented

carefully, the time required to consider the cost at each of the k possible time periods can

still be significant.

The most obvious method for estimating costα for each of the k granularities would

be to simply loop through each of the associated time periods. For each time period,

we would build bounding boxes around each of the trajectories of the objects in P , and

then sample objects from Q as described in Section 3.5 until the cost was estimated with

sufficient accuracy.

However, this simple algorithm results in a good deal of repeated work for each time

period, and can actually decrease the overall speed of the adaptive plane-sweep compared

to the layered plane-sweep. A more intelligent implementation can speed the optimization

process considerably.

In our implementation, we maintain a table of all the objects in P and Q, organized

on the ID of each object. Each entry in the table points to a linked list that contains a

46

chain of MBRs for the associated object. Each MBR in the list bounds the trajectory

of the object for one of the k time-periods considered during the optimization, and the

MBRs in each list are sorted from the coarsest of the k granularities to the finest. The

data structure is depicted in Figure 3-12.

Given this structure, we can estimate costα for each of the k values of alpha in

parallel, with only a small hit in performance associated with an increased value for k.

Any object pair (p ∈ P, q ∈ Q) that needs to be evaluated during the sampling process

described in Section 3.5 is first evaluated at the coarsest granularity corresponding to αk.

If the two MBRs are within distance d of one another, then the cost estimate for αk is

updated, and the evaluation is then repeated at the second coarsest granularity αk−1. If

there is again a match, then the cost estimate for αk−1 is updated as well. The process is

repeated until there is not a match. As soon as we find a granularity at which the MBRs

for p and q are not within a distance d of one another, then we can stop the process,

because if the MBRs for P and Q are not within distance d for the time period associated

with αi, then they cannot be within this distance for any time period αj where j < i.

The benefit of this approach is that in cases where the data are well-behaved and

the optimization process tends to choose a value for α that causes the entire buffer to be

processed at once, a quick check of the distance between the outer-most MBRs of p and q

is the only geometric computation needed to process p and q, no matter what value of k is

chosen.

The bounding box approximations themselves can be formed while the system buffer

is being filled with data from disk. As trajectory data are being read from disk, we grow

the MBRs for each αi progressively. Since each αi represents a fraction of the buffer, the

updates to its MBR can be stopped as soon as that much fraction of the buffer has been

filled. Similar logic can be used to shrink the MBRs when some fraction of the buffer is

consumed and expand it when the buffer is refilled.

47

.

.

.

α3 α4 α5

p2

p1

α2time

y

pn

p2

(α5) (α4) (α3) (α2) (α1)MBRMBRMBR MBRMBR

α1

Figure 3-13. Speeding up the Optimizer

3.5.7 Putting It All Together

In our implementation of the adaptive plane-sweep, data are fetched in blocks

and stored in the system buffer. Then an optimizer routine is called which evaluates

k granularities and returns the granularity α with the minimum cost. Data in the

granularity chosen by the optimizer is then evaluated using the LayeredPlaneSweep

routine (procedure described in Section 3.4). When the LayeredPlaneSweep routine returns,

the buffer is refilled and the process is repeated. The techniques described in the previous

Section are utilized to make the optimizer implementation fast and efficient.

3.6 Benchmarking

This section presents experimental results comparing the various methods discussed so

far for computing a spatiotemporal CPA Join: an R-tree, a simple plane-sweep, a layered

plane-sweep, and an adaptive plane-sweep with several parameter settings. The Section is

organized as follows. First, a description of the three, three-dimensional temporal data sets

used to test the algorithms is given. This is followed by the actual experimental results

and a detailed discussion analyzing the experiemental data.

3.6.1 Test Data Sets

The first two data sets that we use to test the various algorithms result from two

physics-based, N -body simulations. In both data sets, constituent records occupy 80B on

disk (80B is the storage required to record the object ID, time information, as well as the

48

0 500

1000 1500

2000 2500

3000 3500

-1500-1000

-500 0

500 1000

1500

-1500

-1000

-500

0

500

1000

1500

Figure 3-14. Injection data set at time tick 2,650

position and motion of the object). The size of each data set is around 50 gigabytes each.

The two data sets are as follows:

1. The Injection data set. This data set is the result of a simulation of the injection oftwo gasses into a chamber through two nozzles on the opposite sides of the chambervia the depression of pistons behind each of the nozzles. Each gas cloud is treated asone of the input relations to the CPA-join. In addition to heat energy transmittedto the gas particles via the depression of the pistons, the gas particles also have anattractive charge. The purpose of the join is to determine the speed of the reactionresulting from the injection of the gasses into the chamber, by determining thenumber of (near) collisions of the gas particles moving through the chamber. Bothdata sets consist of 100,000 particles, and the positions of the particles are sampledat 3,500 individual time ticks, resulting in two relations that are around 28 gigabytesin size each. During the first 2,500 time ticks, for the most part both gasses aresimply compressed in their respective cylinders. After tick 2,500, the majority of theparticles begin to be ejected from the two nozzles. A small sample of the particles inthe data set is depicted above in Figure 3-13, at time tick 2,650.

2. The Collision data set. This data set is the result of an N -body gravitationalsimulation of the collision of two small galaxies. Again, both galaxies contain around100,000 star systems, and the positions of the systems in each galaxy are polled at3,000 different time ticks. The size of the relations tracking each galaxy is around 24gigabytes each. For the first 1,500 or so time ticks, the two galaxies merely approachone another. For the next thousand time ticks, there is an intense interaction asthey pass through one another. During the last few hundred time ticks, there is lessinteraction as the two galaxies have larglely gone through one another. The purposeof the CPA Join is to find paris of galaxies that apprached closely enough to have a

49

-10000-5000

0 5000

10000-12000-10000

-8000-6000

-4000-2000

0 2000

-6000

-4000

-2000

0

2000

4000

6000

8000

Figure 3-15. Collision data set at time tick 1,500

strong graviational interaction. A small sample of the galaxies in the simulation isdepicted above in Figure 3-14, at time tick 1,500.

In addition, we test a third data set created using a simple, 3-dimensional random walk.

We call this the Synthetic data set (this data set was again about 50GB in size). The

speed of the various objects varies considerably during the walk. The purpose of including

this data is to rigorously test the adpatability of the adaptive plane-sweep, by creating

a synthetic data set where there are significant fluctuations in the amount of interaction

among objects as a function of time.

3.6.2 Methodology and Results

All experiments were conducted on a 2.4GHz Intel Xeon PC with 1GB of RAM. The

experimental data sets were each stored on an 80GB, 15,000 RPM Seagate SCSI disk.

For all three of the data sets, we tested an R-tree-based CPA Join (implemented as

described in Section 3.3; we used the STR R-tree packing algorithm [80] to construct an

R-tree for each input relation), a simple plane-sweep (implemented as described in Section

3.4), a layered plane-sweep (implemented as described in Section 3.5).

We also tested the adaptive plane-sweep algorithm, implemented as described

in Section 6. For the adaptive plane-sweep, we also wanted to test the effect of the

50

0

2000

4000

6000

8000

10000

12000

0 20 40 60 80 100

Tim

e T

aken

% of Join Completed

CPA-Join over Injection Dataset

R-Tree

Simple Sweep

Layered Sweep

Adaptive Sweep(k=20)w/ addl. recursive call



Figure 3-16. Injection data set experimental results

two relevant parameter settings on the efficiency of the algorithm. These settings are

the number of cut-points k considered at each level of the optimization performed by

the algorithm, as well as the number of recursive calls made to the optimizer. In our

experiments, we used k values of 5, 10, and 20, and we tested using either a single or no

recursive calls to the optimizer.

The results of our experiments are plotted above in Figures 3-15 through 3-20.

Figures 3-15, 3-16, and 3-19 show the progress of the various algorithms as a function

of time, for each of the three data sets (only Figure 3-15 depicts the running time of the

adaptive plane-sweep making use of a recursive call to the optimizer). For the various

plane-sweep-based joins, the x-axis of the two plots shows the percentage of the join that

has been completed, while the y-axis shows the wall-clock time required to reach that

point in the completion of the join. For the R-tree-based join (which does not progress

through virtual time in a linear fashion) the x-axis shows the fraction of the MBR-MBR

pairs that have been evaluated at each particular wall-clock time instant. These values are

normalized so that they are comparable with the progress of the plane-sweep-based joins.

51

0

2000

4000

6000

8000

10000

12000

0 20 40 60 80 100

Tim

e T

aken

% of Join Completed

CPA-Join over Collision Dataset

R-Tree

Simple Sweep Layered Sweep

Adaptive Sweep(k = 20)

Adaptive Sweep(k = 10)Adaptive Sweep(k = 5)

Figure 3-17. Collision data set experimental results

Figures 3-17, 3-18 and 3-20 show the buffer-size choices mades by the adaptive

plane-sweeping algorithm using k = 20 and no recursive calls to the optimizer , as a

function of time for all the three test data sets.

3.6.3 Discussion

On all three data sets, the R-tree was clearly the worst option. The R-tree indices

were not able to meaningfully restrict the number of leaf-level pairs that needed to be

expanded during join processing. This results in a huge number of segment pairs that

must be evaluated. It may have been possible to reduce this cost by using a smaller page

size (we used 64KB pages, a reasonable choice for a modern 15,000 RPM hard disk with a

fast sequential transfer rate), but reducing the page size is a double-edged sword. While it

may increase the pruning power in the index and thus reduce the number of comparisons,

it may also increase the number of random I/Os required to process the join, since there

will be more pages in the structure. Unfortunately, however, it is not possible to know

the optimal page size until after the index is created and the join has been run, a clear

weakness of the R-tree.

The standard plane-sweep and the layered plane-sweep performed comparably on the

three data sets we tested, and both far outperformed the R-tree. It is interesting to note

that the standard plane-sweep performed well when there was much interaction among

52

10

20

30

40

50

60

70

80

90

100

0 500 1000 1500 2000 2500 3000 3500

% o

f buf

fer

cons

umed

Virtual timeline in the dataset

Injection Dataset - Buffer Choices by Optimizer (k = 20)

Figure 3-18. Buffer size choices for Injection data set

30

40

50

60

70

80

90

100

0 500 1000 1500 2000 2500 3000

% o

f buf

fer

cons

umed


Collision Dataset - Buffer Choices by Optimizer (k = 20)

Figure 3-19. Buffer size choices for Collision data set

the input relations (when the gasses are expelled from the nozzles in the Injection data

set and when the two galaxies overlap in the Collision data set). During such periods

it makes sense to consider only very small time periods in order to reduce the number

of comparisons, leading to good efficiency for the standard plane-sweep. On the other

hand, during time periods when there was relatively little interaction between the input

relations, the layered plane-sweep performed far better because it was able to process large

time-periods at once. Even when the objects in the input relations have very long paths

during such periods, the input data were isolated enough that there tends to be little cost

associated with checking these paths for proximity during the first level of the layered

plane-sweep. The adaptive plane-sweep was the best option by far for all the three data

53

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 20 40 60 80 100

Tim

e T

aken

% of Join Completed

CPA-Join over Synthetic Dataset

R-Tree

Layered Sweep

Simple Sweep

Adaptive Sweep(k=20)

Adaptive Sweep(k=10)

Adaptive Sweep(k=5)

Figure 3-20. Synthetic data set experimental results

sets, and was able to smoothly transition from periods of low to high activity in the data

and back again, effectively picking the best granularity at which to compare the paths of

the objects in the two input relations.

From the graphs, we can see that the cost of performing the optimization causes the

adaptive approach to be slightly slower than the non-adaptive approach when optimization

is ineffective. In both the data sets, this happens in the beginning when the objects

are moving towards each other but still far enough that no interaction takes place. As

expected, in both the experiments, adaptivity begins to take effect when the objects in

the underlying data set start interacting. From Figures 3-17, 3-18, and 3-20 it can be

seen that the buffer size choices made by the adaptive plane-sweep is very finely tuned to

the underlying object interaction dynamics (decreasing with increasing interaction and

vice versa). In both the Injection and Collision data sets, the size of the buffer falls

dramatically just as the amount of interaction between the input relations increases. In

the Synthetic data set, the oscillations in buffer usage depicted in Figure 20 mimic almost

exactly the energy of the data as they perform their random walk.

The graphs also show the impact of varying the parameters to the adaptive

plane-sweep routine, namely, the number of cut points k, considered at each level of

the optimization, and whether or not a chosen granularity is refined through recursive

54

calls to the optimizer. It is surprising to note that varying these parameters causes no

significant changes in the granularity choices made by the optimizer. The reason is that

with increasing interaction in the underlying data set, the optimizer has a preference

towards smaller granularites and these granularites are naturally considered in more detail

due to the logarithmic way in which the search space is partitioned.

Another interesting observation is that the recursive call does not improve the

performance of the algorithm, for two reasons. First, since each invokation of the

optimization is a separate attempt to find the best cut point in a different time range,

it is not possible to share work among the recursive calls. Second, it is likely that just

being in the feasible region, or at least a region close to it is enough to enjoy significant

performance gains. Since the coarse first-level optimization already does that, further

optimizations in terms of recursive calls to tune the chosen granularity do not seem to be

necessary.

In all of our experiments, k = 5 with no recursive call to the optimizer uniormly

gave the best performance. However, if the nature of the input data set is unknown

and the data may be extremely poorly behaved, then we believe a choice of k = 10

with one recursive call may be a safer, all-purpose choice. On one hand, the cost of

optimization will be increased, which may lead to a greater execution time in most cases

(our experiments showed about a 30% performance hit associated with using k = 10 and

one recusrive call compared to k = 5). However, the benefit of this choice is that it is

highly unlikely that such a combination would miss the optimal buffer choice in a very

difficult scenario with a highly skewed data set.

3.7 Related Work

To our knowledge, the only prior work which has considered the problem of

computing joins over moving object histories is the work of Jeong et al. [51]. However,

their paper considers the basic problem at a high level. The algorithmic and implementation

issues addressed by our own work were not considered.

55

10

20

30

40

50

60

70

80

90

100

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

% o

f buf

fer

cons

umed


Synthetic Dataset - Buffer Choices by Optimizer (k = 20)

Figure 3-21. Buffer size choices for Synthetic data set

Though little work has been reported on spatiotemporal joins, there has been a

wealth of research in the area of spatial joins. The classical paper in spatial-joins is due

to Brinkhoff, Kreigel and Seeger[45] and is based on the R-tree index structure. An

improvement of this work was given by Huang et al.[46]. Hash-based spatial join strategies

have been suggested by Lo and Ravishankar [47], and Patel and Dewitt [48]. Lars et al.

[49] proposed a plane-sweep approach to address the spatial-join problem in the absence of

underlying indexes.

Within the context of moving objects, research has been focused on two main areas:

predictive queries, and historical queries. Within this taxonomy, our work falls in the

latter category. In predictive queries, the focus is on the future position of the objects

and only a limited time window of the object positions need to be maintained. On the

other hand, for historical queries, the interest is on efficient retrieval of past history and

usually the index structure maintains the entire timeline of an object’s history. Due to

these divergent requirements, index structures designed for predictive queries are usually

not suitable for historical queries.

A number of index structures have been proposed to support predictive and historical

queries efficiently. These structures are generally geared towards efficiently answering

56

selection or window queries and do not study the problem of joins involving multi-gigabyte

object histories addressed by our work.

Index structures for historical queries include the 3D-R-trees[20], spatiotemporal

R-Trees and TB (Trajectory Bounding)-trees [21], and linear quad-trees [22] . A technique

based on space partitioning is reported in [18]. For predictive queries, Saltenis et al.[17]

proposed the TPR-tree (time-parametrized R-tree) which indexes the current and

predicted future positions of moving point objects. They mention the sensitivity of

the bounding boxes in the R-tree to object velocities. An improvement of the TPR-tree

can be found in [78]. In [76], a framework to cover time-parametrized versions of spatial

queries by reducing them to nearest-neighbor search problem has been suggested. In [23],

an indexing technique is proposed where trajectories in a d-dimensional space is mapped

to points in higher-dimensional space and then indexed. In [75], the authors propose a

framework called SINA in which continuous spatiotemporal queries are abstracted as a

spatial join involving moving objects and moving queries. An overview of different access

structures can be found in [83].

3.8 Summary

This chapter explored the challenging problem of computing joins over massive

moving object histories. We compared and evaluated obvious join strategies and

then described a novel join technique based on an extension to the plane-sweep. The

benchmarking results suggest that the proposed adaptive technique offers significant

benefits over the competing techniques.

57

CHAPTER 4ENTITY RESOLUTION IN SPATIOTEMPORAL DATABASES

Sensor networks are steadily growing larger and more complex, as well as more

practical for real-world applications. It is not hard to imagine a time in the near future

where huge networks will be widely deployed. A key data mining challenge will be making

sense of all of the data that those sensors produce.

In this chapter, we study a novel variation of the classic entity resolution problem

that appears in sensor network applications. In entity resoluition, the goal is to determine

whether or not various bits of data pertain to the same object. For example, two records

from different data sources that have been loaded into a database may reference the names

“Joe Johnson” and “Joseph Johnson”. Entity resolution methodologies may be useful in

determining if these records refer to the same person.

This problem also appears in sensor networks. A large number of sensors may all

produce data relating to the same object or event, and it becomes necessary to be able

to determine when this is the case. Unfortunately, the problem is exceptionally difficult

in a sensor application, for two primary reasons. First, sensor data is often not as rich

as classical, relational data, and often gives far fewer cluues as to when two observations

correspond to the same oject. The most extreme case is a simple motion sensor, which

will simply report a “yes” indicating that motion was sensed, along with a timestamp and

an approximate location. This provides very little information to make use of during the

resolution process. Second, there is the large number of data sources. The largest sensor

networks in existence today already contain on the order of one thousand sensors. This

goes far beyond what one might expect in a classical entity resolution application.

In this chapter, we consider a specific version of the entity resolution problem

that appears in sensor networks, where the goal is monitoring a spatial area in order

to understand the motion of objects through the area. Given a large database of

spatiotemporal sensor observations that consist of (location, timestamp) pairs, our

58

goal is to perform an accurate segmentation of all of the observations into sets, where each

set is associated with one object. Each set should also be annotated with the path of the

object through the area.

We develop a statistical, learning-based approach to solving the problem. We first

learn a model for the motion of the various objects through the spatial area, and then

associate sensor observations with the objects via a maximum likelihood procedure. The

major technical difficulty lies in using the sensor observations to learn the spatial motion

of the objects. A key aspect of our approach is that we make use of two different motion

models to develop a reasonable solution to this problem: a restricted motion model that is

easy to learn and yet is only applicable to smooth motion over small time periods, and a

more general model that takes as input the initial motion learned via the restricted model.

Some specific contributions of this work are as follows:

1. A unique expectation-maximization (EM) algorithm that is suitable for learningassociations of spatiotemporal, moving object data is described. This algorithmallows us to recognize quadratic (fixed acceleration) motion in a large set of sensorobservations.

2. We apply and extend the method of Bayesian filters for recognizing unrestrictedmotion to the case when a large number of interacting objects produce data, and itis not clear which observation corresponds to which object.

3. Experimental results show that the proposed method can accurately performresolution over more than one hundred simultaneuously moving objects, even whenthe number of moving objects is not known beforehand.

The remainder of this chapter is organized as follows: In the next section, we

state the problem formally and give an overview of our approach. We then describe

the generative model and define the PDFs for the restricted and unrestricted motion.

This is followed by a detailed description of the learning algorithms in section 4.4.

An experimental evaluation of the algorithms is given in section 4.5 followed by the

conclusion.

59

4.1 Problem Definition

Consider a large database of sensor observations associated with K objects moving

through some monitored area. Each observation is a potentially inaccurate ordered pair of

the form (x, t) where x is the position vector and t is the time instance of the observation.

Given a set of object IDs O = {o1, o2, ..oK}, the entity resolution problem that we

consider consists of associating with each observation a label oi ∈ O such that all sensor

observations labeled oi were produced by the same object. That is, we are partitioning the

observations into sets of K classes where each class of observations represents the path of

a moving object on the field.

(4,11,1)

(6,14,2)

(8,17,3)

(9,9,0)

(11,7,1)

(13,5,2)

(15,3,3)

(2,8,0)

(4,11,1)

(6,14,2)

(8,17,3)

(9,9,0)

(11,7,1)

(13,5,2)

(15,3,3)

(2,8,0)

(b)(a)

target 2

target 1

Figure 4-1. Mapping of a set of observations for linear motion

As an example, consider the set of observations: {(2,8,0) (9,9,0) (4,11,1) (11,7,1)

(6,14,2) (13,5,2) (8,17,3) (15,3,3)} as shown in Figure 4-1(a). Given the underlying

motion is linear and K = 2, Figure 4-1(b) shows a mapping of the observations with

objects. Observations {(2,8,0) (4,11,1) (6,14,2) (8,17,3)} are associated with object 1, and

observations {(9,9,0) (11,7,1) (13,5,2) (15,3,3)} are associated with object 2. Though in

this case the classification was easy, the problem in general is hard due to a number of

factors, including:

• Paths traced by real-life objects tend to be irregular and often cannot be approximatedwith simple curves.

• The measurement process is not accurate and often introduces error in theobservations, which needs to be taken into account in classification.

60

• Objects can cross paths, or track one another’s paths for realatively long timeperiods, complicating both the segmentation and the problem of figuring out howmany ojects are under observation.

4.2 Outline of Our Approach

In order to solve this problem in the general case, we make use of a model-oriented

approach. We model the uncertainty inherent in the production of sensor observations by

assuming that the observations are produced by a generative, random process.

The location of an object moving through space is expressed as a function of time. As

an object wanders through the data space, it triggers sensors that generate observations

in a cloud around it. Our model assumes that an object “generates” sensor observations

in a cloud around it by sampling repeatedly from a Gaussian (multidimensional normal)

probability density function (PDF) centered at the current location of the object, which

we denote by fobs. This randomized generative process nicely models the uncertainty

inherent in any sort of sensor-based system. As the object moves through the field, it

tends to trigger sensors that are located close to it, but the probability that sensor is

triggered falls off the further away from the object that the sensor is located.

If we know the exact path of each object through the field, given such a model it

is then a simple matter to perform the required partitioning of sensor observations by

making use of the principle of maximum likelihood estimation (MLE). Using MLE, each

observation is simply associated with the object that was most likely to have produced

it – that is, it is assigned to the object whose fobs function has the greatest value at the

location returned by the sensor.

Of course, the difficulty is that in order to make use of such a simple MLE, we must

first have access to the parameters defining fobs and the motion of the various objects

through the data space. This requires learning the parameters and the underlying motion

which can be very difficult – particularly so in our case, since we are unsure of the number

of objects K that are producing the sensor observations.

61

00

11

22 3

99

99

88

877

76 6

67

6

5444

45

55

33

2

2

8

0

10

00

11

22 3

33

2

2

0

1

00

11

22 3

6 66

6

5444

45

55

33

2

2

0

10

0

11

22 3

99

99

88

877

76 6

67

6

5444

45

55

33

2

2

8

0

1

0

00

(a) (b)

(c) (d)

Figure 4-2. Object path (a) and quadratic fit for varying time ticks (b-d)

One of the key aspects of our approach is that to make the learning process feasible,

we rely on two separate motion models: a restricted motion model that is used for only

the first few time ticks in order to recognize the number and initial motion of the various

objects, and an unrestricted motion model that takes this initial motion as input and

allows for arbitrary object motion. Given this, the following describes our overall process

for grouping sensor observations into objects:

1. First, we determine K and learn the set of model parameters governing objectmotion under the restricted model for the first few time ticks.

2. Next, we use K as well as the object positions at the end of the first few time ticksas input into a learning algorithm for the remainder of the timeline. The goal here isto learn how objects move under the more general, unrestricted motion model.

3. Finally, once all object trajectories have been learned for the complete timeline, eachsensor observation is assigned to the object that was most likely to have produced it.

Of course, this basic approach requires that we address two very important technical

questions:

1. What exactly is our model for how objects move through the data space, and how doobjects probabilistically generate observations around them?

2. How do we ”learn” the parameters associated with the model, in order to tailor thegeneral motion model to the specific objects that we are trying to model?

62

In the next section, we consider the answer to the first question. In section 4.4, we

consider the answer to the second question.

4.3 Generative Model

In this section, we define the generative model that we make use of in order to

perform the classification.

The high-level goal is to define the PDF fobs(x, t|θ). For a particular object, fobs takes

as input a sensor observation x and a time t and gives the probability that particular

object in question would have triggered x at time t. fobs is parametrized on the parameter

set θ, which describes how the object in question moves through the data space, and how

it tends to scatter sensor observations around it.

Before we describe the restricted and general motion models formally, it is worthwhile

to explain the need for two separate models. The parameter set θ governing the motion

of the object through the data space is unknown, and must be inferred from the observed

data. This is a difficult task. The problem is compounded when the observed data are

a set of mixed sensor observations generated by an unknown number of objects. Given

the difficulty inherent in learning θ, we choose to make the initial classification problem

simpler by allowing only a very restricted type of quadratic motion characterized by

constant acceleration.

Furthermore, object paths tend to be complex only over a relatively long time period.

That is, motion seems unrestricted only when we take a long-term view. This is illustrated

in Figure 4-2 where the initial quadratic approximations (for time periods [0 − 3] and

[0− 6]) are faithful to the object’s actual path. As the time period extends and the object

has a chance to change its acceleration, a simple quadratic fit is no longer appropriate

(Figure 4-2(d)).

We take advantage of the fact that a simple motion model may be reasonable for

a short time period, and learn the initial parameters of the generative process by using

a restricted motion model over a small portion δt of the time line. Once the initial

63

(b)(a)

2SD

Figure 4-3. Object path in a sensor field (a) and sensor firings triggered by object motion(b)

paramters are learned, we can make use of the unrestricted model for the remainder of the

timeline since there will be fewer unknowns and the computational complexity is greatly

reduced.

4.3.1 PDF for Restricted Motion

We will now describe the PDF for observations assuming a restricted motion model

that is valid for short time periods. In this model, the location of an object is expressed as

a function of time in the form of a quadratic equation. The restricted model assumes that

acceleration is constant. The position of an object at some time instance t is specified by

the parametric equation:

pt = a · t2 + v · t + p0

where p0 represents the initial position of the object, v the initial velocity, and a the

constant acceleration.

We define the probability of an observation x at time t by the PDF:

fobs(x, t | θ) = fN(x|Σobs,pt)

where

fN(x | Σ,p) =1

2π|Σ|1/2e−

12(x−p)T Σ−1(x−p)

is a Gaussian PDF that models the cloud of sensors triggered by the object at time t.

Figure 4-3 shows a typical scenario of how observations are generated. The parameter set

θ contains:

64

• The object’s initial position p0, initial velocity v, and acceleration a

• The covariance matrix Σobs specifying how the object produces sensor readingsaround itself.

4.3.2 PDF for Unrestricted Motion

While the restricted motion model may be applicable for a resonably small time

period, the fact is that accelerations do not remain constant for long. Thus, we make use

of a second PDF providing for more irregular motion, that can be used over longer time

periods. In the more general PDF, at each time tick an object moves not in a nice, smooth

trajectory, but instead moves through the data space in a completely random fashion.

Given that the object’s position in space at time t− 1 is pt−1, the object’s position at time

t is simply pt−1 + N , where N is a multidimensional, normally distributed random variable

parameterized on the covariance matrix Σmot. This random motion provides for a much

more general model.

One result of using such a very general model is that there is no longer a simple

equation for pt. Instead, pt has to be modeled by a random variable Pt which depends on

random variables for the object’s position at Pt−1 which by itself depends on the random

variable for object’s positon at Pt−2. Thus, the likelihood of observing pt must be specified

via a recursive PDF, where fmot(pt|θ) depends on fmot(pt−1|θ):

fmot(pt|θ) =

∫

pt−1

fN(pt|Σmot,pt−1)fmot(pt−1|θ)dpt−1

As we will discuss in Section 4.4, the fact that an object’s position is not specified

precisely by the parameter set θ and the recursive nature of the PDF make this motion

model much more difficult to deal with, which is why this more general motion model is

not used throughout the timeline.

Given the PDF describing the distribution of an object’s location at time t, it is

then possible to give a PDF specifying the probability that we observe a sensor reading

corresponding to the object at time t:

65

fobs(x, t|θ) =

∫

pt

fN(x|Σobs,pt)fmot(pt|θ)dpt

Thus, for the more general PDF, the parameter set θ contains:

• The covariance matrix Σmot specifying the object’s random motion.

• The object’s initial position p0.

• The covariance matrix Σobs specifying how the object produces sensor readingsaround itself.

Before we can actually map observations to individual objects, we must be able

to learn the two underlying models. As we will discuss in detail subsequently, the term

“learn” has a different meaning for each of the two models.

In the case of the restricted model, “learning” consists of computing the parameter

set θ for each object, as well as determining the number of objects. Since this is a classical

parameter estimation problem, we will make use of an MLE framework that will be solved

by an EM algortihm.

In the case of the unrestricted model, determining the parameter set is easy, in the

sense that once the parameter set of the restricted model has been learned, θ for the

unrestricted model is already fully determined (see Section 4.3.2). However, this does not

mean that use of the unrestricted model is easy. Because the model allows for arbitrary

motion, fmot as described before is not very useful by itself. Thus, our “learning” of the

unrestricted model makes use of Bayesian methods to update and restrict fmot. As we

process the data, sensor observations from the database are used in Bayesian fashion to

update fmot in order to refect a more refined belief in the position of the object. This

updated fmot places less weight in portions of the data space that do not contain sensor

observations relevant to an object in question, and more weight in portions that do.

4.4 Learning the Restricted Model

We begin our discussion of how to learn the restricted model by assuming that

number of objects K is known. We will address the extension to unknown K subsequently.

66

4.4.1 Expectation Maximization

Given a set of observations produced by a single object and the associated PDF

fobs(x, t|θ), the parameter θ can be learned by applying a relatively simple MLE. However,

in our case, the observations come from a set of K unknown objects where each object

potentially contributes to some fraction α of the sample. Note that the individual α’s need

not be uniform since an object moving in a dense field of sensors or a very large object

might produce more observations than an object moving in a sparser region. Given K

objects, the probability of an arbitrary observation x at time t is then given by:

p(x, t | Θ) =K∑

j=1

αj.fobs(x, t|θj)

where Θ = {αj, θj | 1 ≤ j ≤ K} denotes the complete parameter set and αj represents the

fraction of data generated by the jth object with the constraint that∑K

j=1 αj = 1.

Our goal is to learn the complete parameter set Θ. Applying MLE, we want to find a

Θ that maximizes the following likelihood function:

L(Θ | X ) =N∏

i=1

p(xi, ti)

where X = {(x1, t1), (x2, t2), · · · , (xN , tN)} is the set of observations from some initial

time period. As is standard practice, we instead try to maximize the log of the likelihood

since it makes the computations easier to handle:

log(L(Θ | X )) = logN∏

i=1

p(xi, ti) =N∑

i=1

log

(K∑

j=1

αj.fobs(xi, ti | θj)

)

Unfortunately, this problem is in general difficult because of the fact that we do not

know which observation was produced by which object. That is, if we had access to a

vector Y = {y1, y2, · · · , yN} where yi = j if the ith observation is generated by the jth

object, the maximization would be a relatively straightforward problem.

67

The fact that we lack access to Y can be addressed by making use of the EM

algorithm [84]. EM is an iterative algorithm that works by repeating the “E-Step”

and the “M-Step”. At all times, EM maintains a current guess as to the parameter set

Θ. In the E-Step, we compute the so-called “Q-function”, which is nothing more than the

expected value of the log-likelihood, taken with respect to all possible values of Y . The

probability of generating any given Y is computed using the current guess for Θ. This

removes the dependency on Y . The M-Step then updates Θ so as to maximize the value of

the resulting Q function. The process is repeated until there is little step-to-step change in

Θ.

In order to derive an EM algorithm for learning the restricted motion model, we must

first derive the Q function. In general, the Q function takes the form:

Q(Θ, Θi) = E[ log L(X ,Y | Θ) | X , Θi]

In our particular case, this can be expanded to:

Q(Θ, Θg) =N∑

i=1

K∑j=1

log(αj · p(xi, ti | θgj ))Pj,i

where Θg = {αgj , θ

gj | (1 ≤ j ≤ K)} represents our guess for the various parameters of the

K objects and Pj,i is the posterior probabilty that the ith observation came from the jth

object given by the formula:

Pj,i = P (j | xi, ti) =αg

jp(xi, ti | θgj )

K∑j=1

αgjp(xi, ti | θg

j )

(by Bayes Rule)

Once we have derived Q, we need to maximze Q with respect to Θ. Notice that we can

isolate the term containing αj and term containing θj by rewriting the Q function as

follows:

Q(Θ, Θg) =N∑

i=1

K∑j=1

log (αj)Pj,i +N∑

i=1

K∑j=1

log (p(xi, ti | θgj ))Pj,i

68

We can now maximize the above equation with respect to various parameters of interest.

This can be done using standard optimization methods [85]. Doing so results in the

following update rules for the parameter set θj for the jth object:

N∑i=1

Pj,i

N∑i=1

tiPj,i

N∑i=1

t2i Pj,i

N∑i=1

tiPj,i

N∑i=1

t2i Pj,i

N∑i=1

t3i Pj,i

N∑i=1

t2i Pj,i

N∑i=1

t3i Pj,i

N∑i=1

t4i Pj,i

·(

µj

vj

aj

)=

N∑i=1

xiPj,i

N∑i=1

xitiPj,i

N∑i=1

xit2i Pj,i

, Σj =

N∑i=1

(xi − µj)(xi − µj)T pj,i

N∑i=1

pj,i

and αj = 1N

N∑i=1

Pj,i.

Given these equations, our final EM algorithm is given as Algorithm 6.

Algorithm 6 EM Algorithm

1: while Θ continues to improve do2: for each object j do3: for each observation i do4: Compute Pj,i

5: end for6: Compute θj = (µj, vj, aj, αj, Σj) using Pj,i and update rules given above7: end for8: end while

4.4.2 Learning K

So far we have assumed that the number of objects K is known. However, in practice,

we often have very little knowledge about K, thus requiring us to estimate it from the

observed data. The problem of choosing K can be viewed as the problem of selecting

the number of components of a mixture model that describes some observed data. The

69

problem has been well-studied as it arises in many different areas of research, and a variety

of criterion has been proposed to solve the problem [86][87][88][89].

The basic idea behind the various techniques is as follows: Assume we have a model

for some observed data in the form of a parameter set ΘK = {θ1, ..., θK}. Further, assume

we have a cost function C(Θk) to evaluate the cost of the model. In order to select the

model with the optimal number of components, we simply compute Θ for a range of K

values and choose the one with the minimum cost:

K =argmin

K {C(ΘK) | Klow ≤ K ≤ Khigh}

The various techniques proposed in the literature can be distinguished by the cost

criterion they use to evaluate a model: AIC (Akaike’s Information Criterion), MDL

(Minimum Description Length) [88], MML (Minimum Message Length) [90], etc. For the

cost function, we make use of the Minimum Message Length (MML) criterion as it has

been shown to be competitive and even superior to other techniques [89]. MML is an

information theoretic criterion where different models are compared on the basis of how

well they can encode the observed data. The MML criterion nicely captures the tradeoff

between the number of components and model simplicity. The general formula [89] for the

MML criterion is given by:

C(Θk) = −logh(Θk)− logL(X | Θk) +1

2log|I(Θk)|+ c

2

(1 +

1

12

)

where h() describes the prior probabilities of the various parameters, L() the likelihood

of observing the data, |I| is the determinant of the fisher information matrix of the

observed data. For our specific case, we need a formulation that is applicable to Gaussian

distributions [87]:

C(Θk) = P2

∑Khigh

j=Klowlog

(Nαj

12

)+

Khigh

2log N

12+

Khigh(N+1)

2− logL(Y | Θk)

70

One final issue that should be mentioned with respect to choosing K is computational

efficiency. It is clearly unacceptable to re-run the entire EM algorithm for every possible

K in order to minimize the MML criteria. A solution to this is the component EM

framework [91]. In this variation of EM, a model is first learned with a very large K value.

Then, in an iterative fashion, poor components are pruned off and the model is re-adjusted

to incorporate any data that is no longer well-fit. For each resulting value of K, the MML

criteria is checked and the best model is chosen.

4.5 Learning Unrestricted Motion

Once the restricted model has been learned over a short portion of the timeline,

the next step is to use the learned parameters as a starting point in order to extend our

estimate of each object’s postition to the remainder of the timeline.

The learning process for the unrestricted model is quite different than for the

restricted model. The restricted model makes use of a classical parameter estimation

framework, where the goal is to learn the parameter set govening the motion of the object.

In the unrestricted case, the values for the parameter set for fmot are fully defined before

the learning ever begins:

• An object’s initial postiion p0 at the time the unrestricted model becomes applicablecan be computed directly from the parameters learned over the restricted model.

• Σobs can be taken directly from the unrestricted model, since it is one of the learnedparameters.

• Σmot can be determined in a number of ways from the restricted model, such asby using an MLE over the object’s time-tick to time-tick motion for the travectorylearned under the restricted model.

Thus, rather than relying on classical parameter estimation, we instead make use

of Bayesian techniques [62] to update fmot to take into account the various sensor

observations. fmot defines a distribution over pt for every time-tick t, which can be

viewed as describing a belief in the object’s position at time-tick t. In a Bayesian fashion,

this belief (e.g., distribution) can be updated and made more accurate by making use of

71

the observed data. We will use the notation f tmot to denote fmot updated with all of the

information up until time-tick t. Such Bayesian techniques for modeling motion are often

referred to as Bayesian filters [64].

4.5.1 Applying a Particle Filter

The mathematics associated with using a large number of discrete sensor observations

to update fmot quickly become very difficult – particularly in the case of Gaussian motion

– resulting in an unwieldy multi-modal posterior distribution. To address this, we make

use of a common method for rendering Bayesian filters practical, called a particle filter

[57]. A particle filter simplifies the problem by representing f tmot by a set of discrete

“particles”, where each particle is a possible current position for the object. We denote

the set of particles associated with time-tick t as St, and the ith particle in this set is

St[i]. The ith particle has a non-negative weight wi attached to it with the constraint∑

i wi = 1. Highly-weighted particles indicate that the object is more likely to be located

at or close to the particle’s position. Given St, f tmot(pt) simply returns wi if pt = St[i], and

0 otherwise.

The basic application of a particle filter to our problem is quite simple (though there

is a major complication that we will consider in the next subsection). To compute f tmot for

any time tick t, we use a recursive algorithm. For the base case t = 0, we have a single

particle located at p0, having weight one. Then, given a set of particles St−1 for time-tick

t− 1, the set St for time tick t is computed as given in Algorithm 2.

Algorithm 8 Sampling a particle cloud

1: for i = 1 to |St| do2: Roll a biased die so the chance of rolling i is wi

3: Sample from fN(Σmot, St−1[i])4: Add the result as the ith particle for |St|5: end for

Essentially, this is nothing more than sampling from the distribution representing

the object’s position at time t − 1, and then adding the appropriate possible random

motion to each sample. At this point, all weights are uniform. This gives us a discrete

72

respresentation for the prior distribution for the object’s position at time-tick t. We use

f t′mot to denote this prior.

The next step is to use the various sensor observations to update the prior weights in

order to obtain f tmot. For each particle:

wi = Pr[pt = St[i]] =

∏j=1

fN(xj|Σobs, St[i])

|St|∑

k=1

N∏j=1

fN(xj|Σobs, St[i])

Given St, it is then an easy matter to define an updated version of fobs that

corresponds to f tmot:

f tobs(x) =

|St|∑i=1

wifN(x|Σobs, St[i])

4.5.2 Handling Multiple Objects

Unfortunately, the simple filter described in the previous subsection cannot be

applied directly to our problem. Unlike in most applications of particle filters, we actually

have many objects producing sensor observations. As a result, it may be that a given

observation xj has nothing to do with the current object, which we denote by φ. As such,

this observation should not be used to update our beleif in the position of φ.

To handle this, we need to modify the basic framework. Rather than handling each

and every xj in a uniform fashion when dealing with φ, we instead associate a belief

(represented as a probability) with every xj. This belief tells us the extent to which we

think that xj was in fact produced by φ, and is computed as follows:

πφ,j =f t′

obs(xj|φ)∑Kk=1 f t′

obs(xj|k)

Note that f t′obs is the uniform-weighted version of f t

obs computed using the particles

associated with time-tick t, before the particle weights have been updated. f t′obs(x|φ)

73

denotes evaluation of the f t′obs function that is specifically associated with object φ at point

x.

Given this, we now need to produce an alternative formula for wi that takes into

account the possibility that xj was not produced by φ. In the case of a single object, wi

is simply the probability that φ would produce xj given that the actual object location

is St[i]. In the case of multiple objects, wi is the probability that the entire collection of

objects would produce xj given that the actual location of φ is St[i].

To derive an expression for this probabilty, we first compute the likelihood that

another object (other than φ) would produce x, given all of our beliefs regarding which

object produced x:

f t′obs(x|¬φ) =

∑k 6=φ πk,jf

t′obs(x|k)

1− πφ,j

Then, we can apply Bayes rule to compute wi:

wi =

∏j=1(1− πφ,j)f

t′obs(xj|¬φ) + πφ,jfN(xj|Σobs, St[i])∑|St|

k=1

∏j=1(1− πk,j)f t′

obs(xj|¬k) + πk,jfN(xj|Σobs, St[i])

This formula bears a bit of additional discussion. In the numerator, we simply take

the product of all of the likelihoods that are associated with each sensor observation, given

that object φ is actually present at location St[i]. The likelihood of observing xj under this

constraint is (1− πφ,j) × f t′obs(xj|¬φ) + πφ,j × fN(xj|Σobs, St[i]). The first of the two terms

that are summed in this expression is the likelihood that an object other than φ would

produce xj, multiplied by our belief that this is the case. The second of the terms is the

likelihood that φ (located at position St[i]) would produce xj multiplied by our belief that

this is the case. The deonominator in the expression is simply the normalization factor,

which is computed by summing the likelihood over all possible positions of object φ.

74

wi = Pr[pt = St[i]] =

∏j=1

fN(xj|Σobs, St[i])

|St|∑

k=1

N∏j=1

fN(xj|Σobs, St[i])

However, the method has a problem. The update formula in step (2) is valid only

when the observations Xt are for a single object i.e. (K = 1). For multiple objects (i.e

K > 1) we cannot use this formulation since we are allowing potential observations from

other objects to influence the weight update of samples for any given object. To update

the belief of the jth object, we should ideally make use of only some set of observations

Xjt ⊂ Xt attributed to it. Thus, we need a new strategy that updates the weights of

samples for a given object that takes in to account the contribution of other objects

present in the field to the observation set. Our modified update strategy is explained

below.

4.5.3 Update Strategy for a Sample given Multiple Objects

For the purpose of this discussion, we consider K objects where each object is

represented by |St| samples, and N observations produced at time t represented by Xt.

First, we need some definitions.

Prior Probability of an Object: We denote the prior probabilty of some

observation xi given some object j by srcj. This is obtained via Bayes Rule as follows:

srcj = p(xi|j) =p(xi|j)∑Kj=1 p(xi|i)

Probabilty of an Observation: We define the probability xi of an observation in

reference to some object j by the function fobs. There are two variations of this function.

For a given object position pt of object j, the function fobs(xi|j) is described by a

Gaussian PDF fN(.) parametrized on θj = (pt, Σobs).

fobs(xi | j) =1

2π|Σobs|1/2e−

12(xi−pt)T Σ−1

obs(xi−pt)

75

where Σobs describes how observations are scattered around the path of object j.

For a given sample position skm of object k, the function fobs(xi|sk

m) describes the

probability that sensor m of object k can trigger observation xi. In this case, fobs is

parametrized on θ = (skm, Σsensor) where Σsensor describes the width of the region around

which a sensor is able to record observations.

Likelihood of an Observation: The likelihood of an observation xi with respect to

some object j is simply:

L(j|xi) = c ·K∑

j=1

srcj · fobs(xi|j)

where c is a marginalizing constant. The likelihood can be viewed as the possiblity of the

jth object to have triggered the observation.

Weight of a sample Given the prior srcj, the PDF fobs and the likelihood L(j|xi)

we can update the weight associated with some sample m of object k as follows:

wkm = p(sk

m|Xt) =

∑

j 6=k

(srcj · fobs(xi|j)) + srck · fobs(xi|skm)

|St|∑

l=1

xi · fobs(xi|skl )

The update equation for a set of observations X = (x1, · · ·xN) is simply:

wkm = p(sk

m|Xt) =

N∏i=1

(∑

j 6=k

(srcj · fobs(xi|j)) + srck · fobs(xi|skm)

)

|St|∑

l=1

N∏i=1

xi · fobs(xi|skl )

4.5.4 Speeding Things Up

A close examination of the formulas from the previous subsection shows that their

evaluation requires considerable computation, especially if the number of particles per

object, the number of objects, and/or the number of sensor observations per time tick

is very large. However, a couple of simple tricks can alleviate a substantial portion of

the complexity. First, when computing each πphi,j value, we can make use of the average

76

or centroid of object φ in order to compute f t′obs, rather than considering each particle

separately. Thus, we approximate f t′obs with f t

obs(x) ≈ fN(x|Σobs + Σpart, µ) where

µ =∑|St|

i=1 wi × St[i] and Σpart is the covariance matrix computed over the positions of each

particle in St.

Second, for any given object, on average only slightly more than 1/K of the

observations at a given time tick actually apply to it. This is because we are usually

quite sure which observation applies to which object, and for only a few observations is

this in doubt. For a given object, those observations that do not apply to it will have very

low correponding π values and will not affect wi. Thus, when computing wi we first drop

any observation j for which πφ,j does not exceed ε. This can be expected to achieve a

speedup factor of close to K.

4.6 Benchmarking

This section presents an experimental evaluation of the proposed algorithm. The goal

of the evaluation is to answer the following questions:

• How many objects is the algorithm able to classify effectively?

• Is the algorithm able to effectively classify object observations that can span a longperiod of time?

• How accurate is the proposed algorithm in classifying observations into objects?

• Is there an advantage to using the particle filter step?

Methodology. The experiments were conducted over a simple, synthetic database.

This allows us to easily measure the sensitivity of the the algorithm to varying data

characteristics and parameter settings. The database stores sensor observations from a set

of moving objects spanning some time interval.

The database is generated as follows: The various objects are initialized randomly

throughout a 2D field and allowed to perform a random walk through the field. As the

objects move through the field, their position at various time ticks are recorded. At each

time tick, sensor observations were generated as a Gaussian cloud around the various

77

object locations. A snapshot of recorded observations from 40 objects is shown in Figure

4-4.

For various parameter settings, we measure the wall-clock exection time required

to classify all the observations and the classification accuracy of the algorithm. As is

standard practice in machine learning, classification accuracy is measured through the

recall and precision parameters. For a given object, recall denotes the total number of

observations that the classifier assigns to the object and precision denotes the actual

number of observations that are actually produced by the object. Ideally, recall and

precision should be close to 1.

Given this setup, we vary five parameters and measure their effect on execution time and

classification accuracy:

1. numObj: the number of unique objects that produced the observations

2. numTicks: the number of time ticks over which observations were collected

3. stdDev: the standard deviation of the average Gaussian sensor cloud

4. numObs: the average number of sensor firings for a given Gaussian sensor cloud

5. emTime: the portion of the intial time line over which EM was used

Five separate tests were conducted:

1. In the first test, numTicks is fixed at 50, stdDev is fixed at 2% of the width of thefield, and numObs is set at 5. emTime is fixed at 5, numObj is varied from 10 to 110objects in increments of 30.

2. In the second test, numObj is fixed at 40, stdDev is fixed at 2% of the width of thefield, emTime is fixed at 5, and numObs is set at 5. The time interval over whichobservations were recorded numTicks is varied in increments of 25 upto 100 timeticks.

3. In the third test, numObj is fixed at 40, numTicks is fixed at 50, stdDev is fixedat 2% of the width of the field, emTime is fixed at 5, and the average number ofsensor firings generated per object at each time tick numObs is varied from 5 to 25 inincrements of 5.

78

-200

0

200

400

600

800

1000

1200

0 200 400 600 800 1000 1200

Figure 4-4. The baseline input set (10,000 observations)

-200

0

200

400

600

800

1000

1200

0 200 400 600 800 1000 1200

Figure 4-5. The learned trajectories for the data of Figure 4-4

4. In the fourth test, numObj is fixed at 40, numTicks is fixed at 50, numObs is set at 5.We then vary the spread of the Gaussian cloud stdDev from 2% to 10% of the widthof the field.

79

numObj 10 40 70 110

Recall 1.0 0.91 0.76 0.69

Precision 1.0 0.92 0.92 0.93

Runtime 9 sec 38 sec 131 sec 378 secTable 4-1. Varying the number of objects and its effect on recall, precision and runtime.

numTicks 25 50 75 100

Recall 0.93 0.91 0.75 0.64

Precision 0.96 0.93 0.92 0.92

Runtime 21 sec 38 sec 59 sec 72 sec

Table 4-2. Varying the number of time ticks.

numObs 5 10 15 20

Recall 0.91 0.91 0.91 0.92

Precision 0.93 0.92 0.92 0.91


Table 4-3. Varying the number of sensors fired.

stdDev 2% 5% 7% 10%

Recall 0.91 0.90 0.88 0.80

Precision 0.93 0.94 0.91 0.83


Table 4-4. Varying the standard deviation of the Gaussian cloud.

5. In the final test, numObj is fixed at 40, numTicks is fixed at 50, emTime is fixed at5, stdDev is fixed at 2% of the width of the field, and numObs is set at 5. emTime isvaried from 5 to 20 time ticks in increments of 5.

All tests were carried out in a dual-core Pentium PC with 2 GB RAM. The tests were

run in two stages. First, the EM algorithm is run to get an initial estimate of the number

of objects and their starting location. The number of time ticks over which EM is applied

is controlled by the emTime parameter. Next, the estimates produced by EM are used to

bootstrap the particle filter phase of the algorithm, which tracks the individual objects

for the rest of the timeline. In a post-processing step, the recall and precision values are

computed. Each test was repeated five times and the results were averaged across the

runs.

80

emTime 5 10 15 20

Recall 0.91 0.88 0.87 0.83

Precision 0.92 0.95 0.94 0.96


Table 4-5. Varying the number of time ticks where EM is applied.

Results. The results from the experiments are given in Tables 4-1 through 4-5. All

times are wall-clock running times in seconds. The smallest database processed consisted

of around 5000 observations from 40 objects over 25 time ticks. The largest data set

processed consisted of around 40,000 observations from 40 objects over 50 time ticks. Disk

I/O cost is limited to the initial loading of the data set into the main memory. Just to

give an visual illustration, the actual plot of the sensor firings and the learned trajectories

is given in Figures 4-5 and 4-6 for the baseline configuration (numObj 40, numTicks 50,

numObs 5, stdDev 2%, emTime 5).

Discussion. There are several interesting findings. In general, the accuracy of the

algorithm suffers as we vary the parameter of interest from low to high. The algorithm

seems to be particularly sensitive to both the number of objects and the length of the time

interval over which observations are obtained.

As Table 4-1 demonstrates, the classification accuracy suffers as we increase the

number of objects considered by the algorithm. This is because with increasing number

of objects, spatial discrimination is greatly reduced. Objects with observation clouds that

are not well separated are grouped together as a single component by the EM stage of the

algorithm. This has the effect of reducing the total number of objects that is tracked by

the particle filter stage of the algorithm, and the observations from the untracked objects

contribute to a reduced recall.

A somewhat different issue arises when the length of the time interval is increased

(Table 4-2). When the time interval is increased, we increase the chance that the paths

traced by two arbitrary objects will collide. Whenever object paths overlap or intersect,

the individual particle filters tracking the objects can no longer perform any meaningful

81

discrimination between the objects. When this happens, the filters end up dividing the

observations among themselves in an arbitrary manner. A somewhat subtle issue arises

when two objects intersect briefly and then diverge. In this case, the individual particle

filters may end up swapping the objects. Similar factors are in play when we increase the

spread of the sensor firings around object paths (Table 4-3).

A somewhat surprising finding (Table 4-4) is that increasing the density of observations

does not seem to cause any noticeable improvement in classification accuracy other than

increasing the run times. Finally, our use of the EM stage only to bootstrap the particle

filter phase is validated by the results shown in Table 4-5. If EM is used for more than a

few initial time ticks, the limitations of the restricted model employed by EM come in to

play, and result in poor estimates being fed to the filter stage.

4.7 Related Work

There is a wealth of database research on supporting analytics over object paths. This

includes trajectory indexing [18, 21, 92–94], queries over tracks [75, 76, 83] and clustering

paths [35, 37, 38, 95, 96]. However, little work exists in databases that worry about how to

actually obtain the object path.

The only prior work in database literature closely related to the problem we address is

the work of Kubi et al [36]. Given a set of asteroid observations, they consider the problem

of linking observations that correspond to the same underlying asteroid. Their approach

consists of building a forest of k-d trees[97], one for each time tick, and performing a

synchronized search of all the trees with exchange of information among tree nodes to

guide the search towards feasible associations. They assume that each asteroid has atmost

one observation at every time tick and consider only simple linear or quadratic motion

models.

Modeling based approaches [85, 98, 99] have been previously employed in target

tracking to map observations in to targets. The focus is primarily on supporting real-time

tracking using simple motion models. In contrast to existing research, we focus on aiding

82

the ETL process in a warehouse context to support entity resolution and provide historical

aggregation of object movements.

4.8 Summary

This chapter described a novel entity resolution problem that arises in sensor

databases and then proposed a statistical learning based approach to solving the problem.

The learning was carried out in two stages: an EM algorithm applied over a small portion

of the data in the first stage to learn initial object patterns, followed by a particle-filter

based algorithm to track the individual objects. Experimental results confirm the utility of

the proposed approach.

83

CHAPTER 5SELECTION OVER PROBABILISTIC SPATIOTEMPORAL RELATIONS

For nearly 20 years, database researchers have produced various incarnations of

probabilistic data models [100–106]. In these models, the relational model is extended so

that a single stored database actually specifies a distribution of possible database states,

where these possible states are also called possible worlds. In this sort of model, answering

a query is closely related to the problem of statistical inference. Given a query over the

database, the task is to infer some characteristic of the underlying distribution of possible

worlds. For example, the goal may be to infer the probability that a specific tuple appears

in the answer set of a query exceeds some user-specified p.

Along these lines, most of the existing work on probabilistic databases has focused

on providing exact solutions to various inference problems. For example, imagine that

one relation R1 has an attribute lname, where exactly one tuple t from R1 has the value

t.lname = ‘Smith’. The probability of t appearing in a given world is 0.2. t also has

another attribute t.SSN = 123456789, which is a foreign key into a second database table

R2. The probability of 123456789 appearing in R2 is 0.6. Then (assuming that there are

no other ’Smith’s in the database) the probability that ’Smith’ will appear in the output

of R1 ×R2 can be computed exactly as 0.2 × 0.6 = 0.12.

Unfortunately, probabilistic data models where tuples or attribute values can be

described using simple, discrete probability distributions may be of only limited utility in

the real world. If the goal is to build databases that can represent the sort of uncertainty

present in modern data management applications, it is very useful to handle complex,

continuous, multi-attribute distributions. For example, consider an application where

moving objects are automatically tracked—perhaps by video, magnetic, or seismic

sensors—and the observed tracks are stored in a database. The standard, modern method

for automatic tracking via electronic sensory input is the so-called “particle filter” [63],

which generates a complex, time-parameterized probabilistic mixture model for each

84

object that is tracked. If this mixture model is stored in a database, then it becomes

natural to ask questions such as “Find all of the tracks that entered area A from time tstart

to time tend with probability of greater than (p× 100)%.” Answering this question involves

somehow computing the mass of each object’s time-varying positional distribution that

intersects A during the given time range, and checking if it exceeds p.

For many applications, such a problem can be quite difficult—it may be that no

closed-form (and integratable) probability density function (PDF) is even available. For

example, Bayesian inference [62] is a popular method that is commonly proposed as a

way to infer unknown or uncertain characteristics of data—one standard application of

Bayesian inference is automatically guessing the topic of a document such as an email.

The so-called “posterior” distribution resulting from Bayesian inference often has no

closed form, cannot be integrated, and can only be sampled from, using tools such as a

Markov Chain Monte Carlo (MCMC) methods [107].

Thus, in the most general case, an integratable PDF is unavailable, and the user

can only provide an implementation of a pseudo-random variable that can be used to

provide samples from the probability distribution that he or she wishes to attach to an

attribute or set of correlated attributes. By asking only for a pseudo-random generator,

we can handle both difficult cases (such as the Bayesian case) and simpler cases where

the underlying distribution is well-known and widely used (such as Gaussian, Poisson,

Gamma, Dirichlet, etc.) in a unified fashion. Myriad algorithms exist for generating

Monte Carlo samples in a computationally efficient manner [108]. For more details on

how a database system might support user-defined functions for generating the required

pseudo-random samples, we point to our earlier paper on the subject [109].

Our Contributions. If the user is asked only to supply pseudo-random attribute value

generators, it becomes necessary to develop new technologies that allow the database

system to integrate the unknown density function underlying a pseudo-random generator

over the space of database tuples accepted by a user-supplied query predicate. In this

85

chapter, I consider the problem of how the required computations can and should be

performed using Monte Carlo in a principled fashion. I propose very general algorithms

that can be used to estimate the probability that a selection predicate evaluates to true

over a probabilistic attribute or attributes, where the attributes are supplied only in the

form of a pseudo-random attribute value generator.

The specific technical contributions are as follows:

• I carefully consider the statistical theory relevant to applying Monte Carlo methodsto decide whether a database object is “sufficiently accepted” by the query predicate.Unfortunately, it turns out that due to peculiarities specific to the application ofMonte Carlo methods to probabilistic databases, even so-called “optimal” methodscan perform quite poorly in practice.

• I devise a new statistical test for deciding whether a database object should beincluded in the result set when Monte Carlo is used. In practice, the test can be usedto scan a database and determine which objects are accepted by the query using farfewer samples even than existing, “optimal” tests.

• I also consider the problem of indexing for relational selection predicates over aprobabilistic database to facilitate fast evaluation using Monte Carlo techniques.

Chapter Organization. In Section 2, we define the basic problem of evaluating selection

predicates when probabilistic attribute values can only be obtained via Monte Carlo

methods, and consider the false positive and false negative problems associated with

testing whether a database object should be accepted. Section 3 describes a classical test

from statistics that is very relevant, called the sequential probability ratio test (SPRT).

Section 4 describes our own proposed test which makes use of the SPRT. Section 5

considers the problem of indexing for our test. Experiments are described in Section 6,

Section 7 covers related work, and Section 8 concludes the chapter.

5.1 Problem and Background

In this section, we first define the basic problem: relational selection in a probabilistic

database where uncertainty is represented via a black-box, possibly multi-dimensional

sampling function. While we limit the scope by considering only relational selection, we

note that join predicates are really nothing more than simple relational selection over a

86

cross product, and so joins can be handled in a similar fashion. We begin by discussing

the basic problem definition, and then give the reader a step-by-step tour through the

relevant statistical theory, which will provide the necessary background to discuss our own

technical contributions in the following sections.

5.1.1 Problem Definition

We consider generic selection queries of the form: SELECT obj

FROM MYTABLE AS obj

WHERE pred(obj)

USING CONFIDENCE p

pred() is some (any) relational selection predicate over database object obj. pred() may

include references to probabilistic and non-probabilistic attributes, that may or may

not be correlated with one another. For an example of this type of query, consider the

following: SELECT v.ID

FROM VEHICLES v

WHERE v.latitude BETWEEN 29.69.32 AND 29.69.38 AND

v.longitude BETWEEN -82.35.12 AND -82.35.19

USING CONFIDENCE 0.98

This query will accept all vehicles falling in the specified rectangular region with

probability higher than 0.98.

In general, our assumption is that there is a function obj.GetInstance() which

supplies one random instance of the object obj (note that a “random instance” could

contain a mix of deterministic and non-deterministic attributes, where deterministic

attributes have the same value in every sample). In our example, latitude and longitude

could be supplied by sampling from a two-dimensional Gaussian distribution.

5.1.2 The False Positive Problem

87

Algorithm 9 MonteCarlo (MYTABLE, p, n)

1: result = ∅2: for each object obj in MYTABLE do3: k = 04: for i = 1 to n do5: if pred(obj.GetInstance() = true) then6: k = k + 17: end if8: end for9: if (k/n ≥ p) then

10: result = result ∪ obj11: end if12: end for13: return result

Given this interface, one way to answer our basic selection query is to use Monte

Carlo methods to guess whether or not each object should be contained in the answer set,

as described in Algorithm 5-1.

For every database object, a number of random instances of the object are generated

by the Algorithm; the selection predicate pred() is applied to each of them, and the object

is accepted or rejected depending upon how many times pred() evaluates to true.

While this algorithm is quite simple, the obvious problem is that there may be some

inaccuracy in the result. For example, imagine that p is .95, n is 1000, and for a given

object, the counter k ends up as 955. While it may be likely that the probability that

the object is accepted by pred() exceeds .95, this does not mean that the object should

necessarily be accepted; there is a possibility that the “real” chance that a object is

accepted by pred() is 94%, but we just got lucky, and 95.5% of the samples were accepted.

The chance of making such an error is intimately connected with the value n; the larger

the n, the lower the chance of making an error.

For this reason, we might modify our query slightly so that the USING clause also

includes a user-specified false positive error rate:

USING CONFIDENCE p

88

FALSE POSITIVE RATE α

To actually incorporate this error rate into our computation, it becomes necessary

to modify Algorithm 5-1 so that it implements a statistical hypothesis test to guard

against the error. For a given object obj, the inner loop of Algorithm 5-1 runs n

Bernoulli (true/false) random trials, and counts how many true results are observed.

There are many ways to accomplish this. For example, if the real probability that

pred(obj.GetInstance()) is true is p, then the number of observed true results

will follow a binomial distribution. For a given database object obj, we will use π as

shorthand for this probability; that is, π = Pr[pred(obj.GetInstance()) = true]. Using

the binomial distribution, we can set up a proper, statistical hypothesis test with two

competing hypotheses:

H0 : π < p versus H1 : π ≥ p

To do this, we use the fact that:

Pr[k ≥ c | π = p] ≤n∑

k=c

binomial(p, n, k)

Thus, if we want the chance of erroneously including obj in the result set when it should

not be included (that is, when H0 is true) to be less then a user-supplied α, we should

accept obj only if the number of observed true results, k, meets or exceeds the value c,

where∑n

k=c binomial(p, n, k) does not exceed the user-supplied α. Thus, we can first

compute the largest such c, and replace the last “if” statement (lines (9)-(11)) in the

pseudo-code for Algorithm 5-1 with:

if k ≥ c then

result = result ∪ obj

end if

89

Then, we will be sure that we will be unlikely to erroneously include “incorrect”

results in the answer set.

5.1.3 The False Negative Problem

There is a key problem with this approach: it only guards against false positives,

and provides no protection against false negatives; using the terminology common in the

statistical literature, this approach provides no guarantees as to the “power” of the test.1

Fortunately, standard statistical methods make it easy to handle a lower bound of

the power of the test. Assume that we alter the query syntax so that the desired power is

specified directly in the query:

USING CONFIDENCE p

FALSE POSITIVE RATE α

FALSE NEGATIVE RATE β

Then, given some small value ε, we wish to choose from one of the two alternatives:2

H0 : π = p + ε versus H1 : π = p− ε

When evaluating a query, either H0 or H1 should be chosen for each database object obj,

subject to the constraint that:

• Pr[Accept(H0) | π ≤ p− ε] is less than α

• Pr[Accept(H1) | π ≥ p + ε] is less than β

1 In fact, this particular binomial test is quite weak compared to other possible tests.

2 We assume that ε is an internal system parameter that is not chosen directly by theuser—this is an important point discussed in depth in Section 4. ε is set to be smallenough that no reasonable user would care about the difference between p + ε andp − ε. Since most PDFs stored in a so-called “probabilistic database” are the result ofan inexact modeling and inference process that introduces its own error and renders veryhigh-precision query evaluation of somewhat dubious utility, ε should not be too small inpractice. We expect that on the order of 10−4 might be reasonable.

90

If these two constraints are met, then when H0 is accepted we can put obj into the answer

set and be sure that the probability of incorrectly including obj is at most α, and when H1

is accepted we can safely leave obj out of the answer set and be sure that the probability

of incorrectly dismissing obj is at most β.

Fortunately, it is quite easy to do this using standard statistical machinery known as

the Neyman-Pearson test [110], or Neyman test for short. For a given database object obj,

the Neyman test chooses between H0 or H1 by analyzing a fixed sample of size n drawn

using GetInstance(). The test relies on a likelihood ratio test (LRT) that compares

the probabilities of observing the sample sequence under H0 and H1. It is named after

a theoretical result (the Neyman-Pearson lemma) that states that a test based on LRT

is the most powerful test of all possible tests for a fixed sample size n comparing the

two simple hypotheses (i.e. it is a uniformly-most-powerful test). Since the Neyman test

for the Bernoulli (yes/no) probability case is given in many textbooks on hypothesis

testing, we omit its exact definition here. Given an implementation of a Neyman test that

returns ACCEPT if H1 is selected, it is possible to replace lines (9) to (11) of Algorithm 5-1

with:

if (Neyman (obj, pred, p, ε, α, β) = ACCEPT ) then

result = result ∪ obj

end if

The resulting framework will then correctly control the false positive and false

negative rates associated with the underlying query evaluation.

5.2 The Sequential Probability Ratio Test (SPRT)

While the Neyman test is theoretically “optimal”, it is important to carefully consider

what the word “optimal” means in this context: it means that no other test can choose

between H0 and H1 for a given α and β pair in fewer samples—specifically, no other

test can do a better job when either H0 or H1 is true. The problem is that in a real,

probabilistic database there is little chance that either H0 or H1 is true: these two

91

Figure 5-1. The SPRT in action. The middle line is the LRT statistic

hypothesis relate to specific, infinitely-precise probability values p + ε and p − ε, when in

reality the true probability π is likely to be either greater than p + ε or less than p − ε,

but not exactly equal to either of them. In this case, the Neyman test will still be correct

in the sense that while still respecting α and β, it will choose H0 if π < p − ε and H1 if

π > p − ε. However, the test is somewhat silly in this case, because it still requires just

as many samples as it would in the hard case where π is precisely equal to one of these

values.

To make this concrete, imagine that p = .95, and after 100 samples have been taken

from GetInstance(), absolutely none of them have been accepted by pred(), but the

Neyman algorithm has determined that in the worst case, we need 105 to choose between

H0 and H1. Even though there is a probability of at most (1−.95)100 < 10−130 of observing

100 consecutive false values if π was at least 0.95, the test cannot terminate—meaning

that we must still take 99,900 more samples. In this extreme case we would like to be able

to realize that there is no chance that we will accept H1 and terminate early with a result

of H0. In fact, this extreme case may be quite common in a probabilistic database where p

will often be quite large and pred() highly selective.

Not surprisingly, this issue has been considered in detail by the statistics community,

and there is an entire subfield of work devoted to so-called “sequential” tests. The basis

92

for much of this work is Wald’s famous sequential probability ratio test [111], or SPRT

for short. The SPRT can be seen as a sequential version of the Neyman test. At each

iteration, the SPRT draws another sample from the underlying data distribution, and uses

it to update the value of a likelihood ratio statistic. If the statistic exceeds a certain upper

threshold, then H1 is accepted. If it ever fails to exceed a certain lower threshold, then H0

is accepted. If neither of these things happen, then at least one more iteration is required;

however, the SPRT is guaranteed to end (eventually).

Thus, over time, the likelihood ratio statistic can be seen as performing a random

walk between two moving “goalposts”. As soon as the value of the statistic falls outside

of the goalposts, a decision is reached and the test is ended. The process is illustrated in

Figure 5-1. This plot shows the SPRT for a specific case where π = .5, ε = .05, p = .3, and

α = β = 0.05. The x-axis of this plot shows the number of samples that have been taken,

while the wavy line in the middle is the current value of the LRT statistic. As soon as the

statistic exits either boundary, the test is ended.

The key benefit of this approach is that for very low values of π that are very far from

p, H0 is accepted quickly (H1 is accepted with a similar speed when π greatly exceeds

p). All of this is done while fully controlling for the multiple-hypothesis-testing (MHT)

problem: when the test statistic is checked repeatedly, then extreme care must be taken

with respect to α and β because there are many chances to erroneously accept H0 (or

H1), and so the effective or real α (or β) can be much higher than what would naively

be expected. Furthermore, like the Neyman test, the SPRT is also “optimal” in the sense

that on expectation, it requires no more samples than any other sequential test to choose

between H0 and H1, assuming that one of the two hypotheses are true.

Just like the neyman test, the SPRT makes use of a likelihood ratio statistic. in the

bernoulli case we study here, after numacc samples that are accepted by pred() out of num

93

Algorithm 10 SPRT (obj, pred, p, ε, α, β)

1: mult=ab

2: tot = 0

3: numAcc = 0

4: constUp =log 1−β

α

b

5: constDown =log β

1−α

b

6: while (constDown + tot < numAcc < constUp + tot ) do7: sample = obj.GetInstance()8: if (pred(obj.GetInstance () = true) then9: tot = tot + mult

10: end if11: end while12: if (numacc >= constup + tot) then13: decision = accept

14: else15: decision = reject

16: end if17: return decision

total samples, this statistic is:

λ = numacc logp + ε

p− ε+ (num− numacc) log

1− p− ε

1− p + ε

given λ, the test continues as long as:

logβ

1− α< λ < log

1− β

α

for simplicity, this can be re-worked a bit. let:

a = log1− p + ε

1− p− ε, b = log

p + ε

p− ε− log

1− p− ε

1− p + ε

then the test continues as long as:

numacc ≥ log 1−βα

b+ num

a

b

and:

numacc ≤ log β1−α

b+ num

a

b

94

this leads directly to the pseudo-code for the basic sprt algorithm, which can be inserted

into algorithm 1 to produce a test which uses an adaptive sample size to choose between

h0 and h1. The pseudo-code is given as Algorithm 2.

5.3 The End-Biased Test

In this section, we devise a new, simple sequential test called the end-biased test that

is specifically designed to work well for queries over a probabilistic database.

5.3.1 What’s Wrong With the SPRT?

To motivate our test, it is first necessary to consider why the SPRT and its existing,

close cousins may not be the best choice for use in a probabilistic database.

The SPRT and its variants (of which there are many—see the related work section)

are widely-used in practice. Unfortunately, there is a key reason why the SPRT as it

was originally defined is not a good choice for use in the inner-more loop of a selection

algorithm over a probabilistic database: the existence of the “magic” constant ε.

In classical applications of the SPRT, ε (that is, the distance between H0 and H1)

is carefully chosen in an application-specific fashion by an expert who understands the

significance of the parameter and its effect on the cost of running the SPRT. For example,

a widget producer may wish to sample the widgets that his/her assembly line produces to

see if the unknown rate of defective widgets is acceptable by sampling from the widgets

that are produced by the line. In this setting, ε would be chosen so that there is an

economically significant difference between H0 and H1, while at the same time taking

into account the adverse effect of a small ε; a small ε is associated with a large number

of (expensive) samples. That is, p − ε is likely chosen so that the defect rate is so low

that it would be a waste of money and time to stop the production line and determine

the problem. On the other hand, p + ε is chosen at the point where production must

be stopped, because so many defective widgets are produced that the associated cost is

unacceptable. The widget producer understands that if the true rate of defective widgets

is between p − ε and p + ε, the SPRT may return either result, so there is a natural

95

inclination to shrink its value; however, he/she is also strongly motivated to make ε as

large as possible because she/he also understands that a small ε will require that more

widgets be sampled, which increases the cost of the quality control program.

Unfortunately, in the context of a probabilistic database, the existence of a user-defined

ε parameter with such a profound effect on the cost of the test is highly problematic. We

contrast this with the fairly intuitive nature of the parameter p. user might choose

p = 0.95 if she/he wants only those objects that are “probably” accepted by pred().

She/he might choose p = 0.05 if she/he wants any object with even a slight chance of

being accepted by pred(). p may even be computed automatically in the case of a top-k

query. But what about ε? Without an in-depth understanding of the effect of ε on the

SPRT, the choice of ε is necessarily arbitrary. A user may wonder, “why not simply choose

ε = 10−5 to ensure that all results are highly accurate?” The reason is that this may

(or may not) have a very significant effect on the speed of query evaluation, depending

upon many factors that include the particular predicate chosen as well as the underlying

probabilistic model—but it is not acceptable to ask an end-user to understand and

account for this!

5.3.2 Removing the Magic Epsilon

According to basic statistical theory, it is impossible to do away with ε altogether.

Intuitively, no statistical hypothesis test can decide between two options that are almost

identical. Thus, our goal is to take the choice of ε away from the user, and simply ship

the database software with an ε that is small enough that no reasonable user would care

about the error induced (see the relevant discussion in Footnote 1). The problem with this

plan is that the classical SPRT may require a very large number of samples to terminate

with a small ε. For example, consider the following, simple test. We choose p = 0.5,

ε = 10−5, and π = .2, and run Algorithm 5-3: in this case, it turns out that more than ten

thousand samples are required for the test to terminate. For a one-million object database,

generating this many samples per object is probably unacceptable.

96

Figure 5-2. Two spatial queries over a database of objects with gaussian uncertainty

The unique problem in the database context is that while H0 and H1 are very close

to one another (due to a tiny ε), in reality, π is typically very far from both p − ε and

p + ε; usually, it will be close to zero or one. For example, consider Figure 5-2, which

shows a simple spatial query over a database of objects whose positions are represented

as two-dimensional Gaussian density functions (depicted as ovals in the figure). For both

the more selective query at the left and the and less selective query at the right, only the

few objects falling on the boundary of the query region would have π ≈ p ± ε for any

user-specified p 6= 0, 1.

This creates a unique setup that is quite different from classic applications of the

SPRT and its variants. In fact, the SPRT itself is provably optimal for only π values lying

at p − ε and p + ε; but for those far from these boundaries (such as at zero and one), it

may do quite poorly. Many other “optimal” tests have been proposed in the statistical

literature, but few seem to be applicable to this rather unique application domain—see the

experimental section as well as the related work section for more details.

5.3.3 The End-Biased Algorithm

As a result, we propose our own sequential test that is specifically geared towards

operating in a database environment, where (a) ε is vanishingly small, (b) π for the typical

object is close to 0 or 1, and (c) only for a few objects is π ≈ p.

The algorithm that we develop is called the end-biased test. Unlike many of the tests

from statistics, it has no optimality properties, but by design it functions very well in

practice—an issue we consider experimentally in Section 6.

97

Figure 5-3. The sequence of SPRTs run by the end-biased test

To perform the end-biased test, we run a series of pairs of individual hypothesis tests.

In the first pair of tests, one SPRT is run right after another:

• The first SPRT tries to reject the object quickly, in just a few samples, if this ispossible. To do this, a standard SPRT is run to decide between H0 : π = p/2, andH1 : π = p + ε. If the SPRT accepts H0, then obj is immediately rejected. However,if the SPRT accepts H1, then a second test is run.

• The second test tries to accept the object quickly, again in just a few samples, if thisis possible. To do this, a standard SPRT is run to decide between H0 : π = p − ε,and H1 : π = p + 1−p

2. If the SPRT accepts H1, then obj is immediately accepted.

However, if the SPRT accepts H0, then the object survives for another round oftesting.

98

The first pair of tests is set up so that the “region of indifference” (that is, the region

between H0 and H1 in each test) is very large. A large region of indifference tends to

speed the execution of the test. Intuitively, the reason for this is that it is much easier to

decide between two disparate values for p such as p = .1 and p = .9 than it is to decide

between close values such as p = .1 and p = 0.100001, because the latter two values

for p can explain any given observation almost equally well. Thus, the relatively large

indifference ranges used by the first pair of SPRT sub-tests in the end-biased test tends to

allow values below p/2 or above p + 1−p2

to accepted or rejected very quickly.

The drawback of using a large region of indifference is that if π falls within either

test’s region of indifference, then the test can produce an arbitrary result that is not

governed by the test’s false positive and false negative parameters. Fortunately, since we

choose the region of indifference so that it always falls entirely below p + ε in the rejection

case (or above p − ε in the accept case), this will not cause problems in terms of the

correctness of the test. For example, in the rejection case, if H1 is accepted for an object

whose π value happens to fall in the region of indifference, then we do not immediately

(incorrectly) accept the object as an actual query result—rather, we will then run the

second SPRT to determine if the object should actually be accepted. The real problem

with an erroneous H1 for an object in the region of indifference means that the object is

not immediately pruned and we will need to do more work.

If an object is neither accepted or rejected by the first pair of tests, then a second pair

of tests must be run. This time, however, the size of the region of indifference is shrunk by

a fraction of 12

for both the rejection test and the acceptance test. This means that more

samples will probably be required to arrive at a result in either test—due to the fact that

H0 and H1 will be closer to one another—but it also means that fewer objects will have π

values that fall in either test’s region of indifference. Specifically, the third SPRT that is

run is used to determine possible rejection using H0 : π = 3p/4 versus H1 : π = p+ ε. If the

SPRT accepts H0, then obj is immediately rejected. However, if the third SPRT accepts

99

H1, then a second test for acceptance (the fourth test overall) is run. This test checks

H0 : π = p − ε against H1 : π = p + 1−p4

. If the SPRT accepts H1, then obj is accepted,

otherwise, a third pair of tests are run, and so on.

This process is repeated, shrinking the region of indifference each time, until one of

two things happens:

1. The process terminates with either an accept or a reject in some test, or;

2. The space of possible π values for which the process would not have terminated fallsstrictly in the range from p − ε to p + ε. In this case, an arbitrary result can bechosen.

The sequence of SPRT tests that are run is illustrated above in Figure 5-3. At each

iteration, the region of indifference shrinks, until it becomes vanishingly small and the

test terminates. Since a large initial region of indifference means that the first few tests

terminate quickly (but will only accept or reject large or small values of π), the test

is “end-biased”; that is, it is biased towards terminating early in those cases where π

is either small or large. For those values that are closer to p, more sub-tests and more

samples will be required—which is very different from classical tests such as the SPRT or

Neyman test, which try to optimize for the case when π is close to p± ε.

α, β, and the MHT problem. One thing that we have ignored thus far is how to

choose α′ and β′ (that is, the false negative and false positive rate of each individual SPRT

subtest) so that the overall, user-supplied α and β values are respected by the end-biased

test. This is a bit more difficult than it may seem to be at first glance: one significant

result of running a series of SPRTs is that it becomes imperative that we be very careful

not to accidentally accept or reject an object due to the fact that we are running multiple

hypothesis tests.

We begin our discussion by assuming that the limit on the number of pairs of tests

run is n; that is, there are n tests that can accept obj, and there are n tests that can reject

obj. We also note that in practice, the 2n tests are not run in sequence, but they are run

100

in parallel; this is done so that all of the tests can make use of the same samples, and thus

samples can be re-used and the total number of samples is minimized (see Algorithm 5-3

below and the accompanying discussion). Specifically, first we use obj.GetInstance () to

generate one sample from the underlying distribution, then we feed this sample to each

of the 2n tests. If any one of the n acceptance tests accepts the object, then the overall

end-biased test accepts the object; if any one of the n rejection tests rejects the object,

then the overall end-biased test rejects the object.

Given this setup, imagine that there is an object obj that should be accepted by the

end-biased test. We ask the question, “what is the probability β that we will falsely reject

obj?” This can be computed as:

β = Pr[n∨

i=1

(reject in rejection test i | no prior accept)]

In this expression, “no prior accept” means that no test for acceptance of obj terminated

with an accept before test i incorrectly rejected. We can then upper-bound β by simply

removing this clause:

β ≤ Pr[n∨

i=1

(reject in test i]

The reason for this inequality is that by removing any restriction on the set of outcomes

accepted by the inner boolean expression, the probability that any event is accepted by

the expression can only increase. Furthermore, by Bonferroni’s inequality [112], we have:

β ≤ Pr[n∨

i=1

reject in test i] ≤n∑

i=1

Pr[reject in test i]

As a result, if we run each individual rejection test using a false reject rate of β′, we know

that:

β ≤ Pr[n∨

i=1

reject in test i] ≤ n× β′

Thus, by choosing β′ = β/n, we correctly bound the false negative rate of the overall

end-biased test. A similar argument holds for the false positive rate: by choosing a rate of

101

α′ = α/n for each individual test, we will correctly bound the overall false positive rate of

the test.

The Final Algorithm. Given all of these considerations, pseudo-code for the end-biased

test is given in Algorithm 5-3.

Algorithm 11 EndBiased (obj, pred, p, ε, α, β)

1: rejIndLo = p2; accIndHi = 1−p

2

2: numTests = 03: while (p - rejIndLo < p−ε) or (p + accIndHi > p+ε) /* first, count the number of

tests */ do4: numTests++

5: rejIndLo /= 2; accIndHi /= 2

6: end while7: for i = 1 to numTests /* now, set up the tests */ do

8: rejSPRTs[i].Init (p− p× 12

i, p + ε, α/numTests, β/numTests)

9: accSPRTs[i].Init (p− ε, p + (1− p)× 12

i, α/numTests, β/numTests)

10: end for11: while any test is still going /* run them all */ do12: sam = pred(obj.GetInstance())13: for i = 1 to numTests do14: if rejSPRTs[i].AddSam (sam) == REJECT then15: return REJECT

16: end if17: if accSPRTs[i].AddSam (sam) == ACCEPT then18: return ACCEPT

19: end if20: end for21: end while22: return ACCEPT

This algorithm assumes two arrays of SPRTs, where the elements of each array

function just like the classic SPRT from Algorithm 5-3. The only difference is that the

various SPRTs are first initialized (via a call to Init) and then fed true/false results

one-at-a-time, via calls AddSam()—that is, they do not operate independently. The array

rejSPRTs[] attempts to reject obj; the array accSPRTs[] attempts to accept obj.

For simplicity, in Algorithm 5-3, each sample is added to each and every SPRT in

turn. In practice, this can be implemented more efficiently in a way that produces a

102

statistically equivalent outcome. First, we run rejSPRTs[0] to completion; if this SPRT

does not reject, then accSPRTs[0] picks up where the first SPRT left off (using its final

count of accepted samples) and runs to completion. If this SPRT does not accept, then

rejSPRTs[1] picks up where the second one left off and also runs to completion. This is

repeated until any member of rejSPRTs[] rejects, any member of accSPRTs[] accepts, or

all SPRTs complete.

5.4 Indexing the End-Biased Test

The end-biased test can easily be used to answer a selection query over a database

table: apply the test to each database object, and add the object to the output set

if it is accepted. However, this can be costly if the underlying database is large. One

of the longest-studied problems in database systems is how to speed such selection

operations—particularly in the case of very selective queries—via indexing. Fortunately,

the end-biased test is amenable to indexing, which is the issue we consider in this section.

Specifically, we consider the problem of indexing for queries where the spatial location of

an object is represented by the user-defined sampling function GetInstance(), because

spatial and temporal data is one of the most obvious application areas for probabilistic

selection.

5.4.1 Overview

The basic idea behind our proposed indexing strategy is as follows:

1. First, during an off-line pre-computation phase, we obtain, from each databaseobject, a sequence of samples. Those samples (or at least a summary of the samples)are stored within in an index to facilitate fast evaluation of queries at a later time.

2. Then, when a user asks a query with a specific α, β, p, and a range predicate pred(),the first step is to determine how many samples would need to be taken in order toreject any given object by the first rejection SPRT in the end-biased test, if pred()evaluated to false for each and every one of those samples. This quantity can becomputed as:

minSam = b log ( β′1−α′ )

log (1−p−ε1−p+ε

)c

103

Figure 5-4. Building the MBRs used to index the samples from the end-biased test.

3. Once numSam is obtained, the index is used to answer the question: “Whichdatabase objects could possibly have one of the first numSam samples in itspre-computed sequence accepted by pred()?” All such database objects are placedwithin a candidate set C. All those object not in C are implicitly rejected.

4. Finally, for each object within C, an end-biased test is run to see whether the objectis actually accepted by the query.

In the following few subsections, as we discuss some of the details associated with each of

these steps.

5.4.2 Building the Index

The first issue we consider is how to construct the index for the pre-computed

samples. For each database object having d attributes that may be accessed by a spatial

range query, index construction results in a series of (d + 1)-dimensional MBRs (minimum

bounding rectangles) being inserted into a spatial index such as an R-Tree. Each MBR has

a lower and upper bound for each of the d probabilistic attributes to be indexed, as well as

a lower bound b′ and an upper bound b on a sample sequence number. Specifically, if an

MBR associated with object obj has a bound (b′, b) and rectangle R, this means the first b

pre-computed samples produced via calls to obj.GetInstance() all fell within R.

104

In addition, a pseudo-random seed value S is also stored along with the MBR3 . S is

the seed used used to produce all obj.GetInstance() values starting at sample number

b′. Storing this pair is of key importance. As we describe subsequently, during query

evaluation S can be used to re-create some of the samples that were bounded within R.

Given this setup, to construct a series of MBRs for a given object, the following

method is used. For a given number of pre-computed samples m we first store the pair

(S, b′) where S is the initial pseudo-random number seed, and b = 14 . We then use S to

obtain two initial samples and bound them using the rectangle R. After this initialization,

the following sequence of operations is repeated until m samples have been taken:

1. First, we obtain as many samples as are needed until a new sample is obtained thatcannot fit into R.

2. Let b be the current number of samples that have been obtained. Create a (d +1)-dimensional MBR using R along with the sequence number pair (b′, b − 1), andinsert this MBR along with the current S and the object identifier into the spatialindex.

3. Next, update R by expanding it so that it contains the new sample. Update S to bethe current random number seed, and set b′ = b.

4. Repeat from step (1).

This process is illustrated pictorially above in Figure 5-4, for a series of one-dimensional

random values, up to b = 16. In this example, we begin by taking two samples during

initialization. We then keep sampling until the fifth sample, which is the first one

3 Since true randomness is difficult and expensive to create on a computer, virtually allapplications using Monte Carlo methods make use of pseudo-random number generation[108]. To generate a pseudo-random number, a string of bits (called a seed) is first sentas an argument to a function that uses the bits to produce the next random value. As aside-effect of producing the new random value, the seed itself is updated. This updatedseed value is then saved and used to produce the next random value at a later time.

4 m would typically be chosen to be just large enough so that with any reasonable,user-supplied query parameters, it would always be possible to reject a database objectwhere pred(obj.GetInstance) evaluated to false m times in a row.

105

Figure 5-5. Using the index to speed the end-biased test

that does not fit into the initial MBR. This completes to step (1) above. Then, a

two-dimensional MBR is created to bound the sample sequence range from 1 to 4, as

well as the set of pseudo-random values that have been observed. This MBR is inserted

(along with S) into the spatial index as MBR 1 (step (2)). Next, the fifth sample is used

to enlarge the MBR (step (3)) More samples are taken until it is found that the eighth

sample does not fit into the new MBR (back to step (1)). Then, MBR 2 is created to

cover the first seven samples as well as the sequence range from 5 to 7, and inserted into

the spatial index. The process is repeated until all m samples have been obtained. The

process can be summed up as follows: every time that a new sample forces the current

MBR to grow, a copy of the MBR is first inserted into the index, and then the MBR is

expanded to accommodate the sample.

5.4.3 Processing Queries

To use the resulting index to process a range query R encoded by the predicate

pred(), the minimum number of samples required to reject minSam is first computed as

described in Section 4.1. Then, a query Q is issued to the index searching for all MBRs

intersecting R as well as the sample sequence range from 1 to minSam. Due to the way

that the MBRs inserted into the index were constructed, we know that any database

106

object obj that does not intersect Q can immediately (and implicitly) be rejected, because

the MBR covering the first minSam samples from obj.GetInstance() did not intersect R.

However, we must still deal with those objects that did have an MBR intersecting

Q. For those objects, we run a modified version of the end-biased test that “skips ahead”

as far as possible in the sample sequence—the details of this modified test are not too

hard to imagine, and are left out for brevity. For a given intersecting object, we find

the MBR with the lowest sample sequence range that intersected Q. For example,

consider Figure 5-5; in this example, the MBRs from Object 1 intersect Q. We choose the

“earlier” of the two MBRs, which is the MBR covering the sample sequence range from

6 through 9. Let b′ be the low sample sequence value associated with this MBR, and let

S be the pseudo-random seed value associated with it. To run the modified end-biased

test, we use Algorithm 5-3, as well as the fact that none of the first b′ − 1 samples from

obj.GetInstance() could have been accepted by pred(). Thus, we initialize each of the 2n

SPRTs with b′ − 1 false samples, and start execution at the bth sample. In this way, we

skip immediately to the first sample sequence number that was likely accepted by pred().

5.5 Experiments

In this section, we experimentally evaluate our proposed algorithms. Specific

questions we wish to answer are:

1. In a realistic environment where a selection predicate must be run over millions ofdatabase objects, how well do standard methods from statistics perform?

2. Can our end-biased test improve upon methods from the statistical literature?

3. Does the proposed indexing framework effectively speed application of the end-biasedtest?

Experimental Setup. In each of our experiments, we consider a simple, synthetic

database, which allows us to easily test the sensitivity of the various algorithms to

different data and query characteristics. This database consists of two-dimensional

Gaussians spread randomly throughout a field. For a number of different (query, database)

107

combinations, we measure the wall-clock execution time required to discover all of the

database objects that fall in a given, rectangular box with probability p. Since in all cases

the database size is large relative to the query result size, the false positive rate we allow is

generally much lower than the false negative rate. The reason is that a false positive rate

of 1% over a 10-million object database could theoretically result in 1× 105 false positives.

Thus, in all of our queries, we use a false positive rate of (number of database object)−1 an

a false negative rate of 10−2 since the average result size is quite small, so a relatively high

false drop rate is acceptable. The ε value we use in our experiments is 10−5.

Given this setup, there are four different parameters that we vary to measure the

effect on query execution time:

1. dbSize: the size of the database, in terms of the number of objects.

2. stdDev: the standard deviation of an average Gaussian stored in the database,along each dimension. Since this controls the spread of each object’s PDF, if thevalue is small, then effectively the database objects are all very far apart from eachother. As the value grows, the objects effectively get closer to one another, until theyeventually overlap.

3. qSize: this is the size of the query box.

4. p: this is the user-supplied p value that is used to accept or reject objects.

We run four separate tests:

1. In the first test, stdDev is fixed at 10% of the width of the field, qSize along eachaxis is fixed at 3% of the width of the field, and p is set at 0.8. Thus, many databaseobjects intersect each query, but likely none are accepted. dbSize is varied from 106

to 3× 106 to 5× 106 to 107.

2. In the second test, dbSize is fixed at 107. stdDev is again fixed at 1%, and p is 0.95.qSize is varied from 0.3% to 1% to 3% to 10% along each axis. In the first case,most database object intersecting the query region are accepted; in the latter, noneare since the object’s spread is much greater than the query region.

3. In the third test, dbSize is fixed at 107, qSize is 3%, p = 0.8, and stdDev is variedfrom 1% to 3% to 10%.

4. In the final test, dbSize is 107, qSize is 3%, stdDev is 10%, and p is varied from 0.8to 0.9 to 0.95. The first case is particularly difficult because while very few objects

108

Method 106 3× 106 5× 106 107

SPRT 568 sec 1700 sec 2824 sec 5653 secOpt 2656 sec 8517 sec 14091 sec 26544 sec

End-biased 9 sec 24 sec 38 sec 76 secIndexed 1 sec 3 sec 7 sec 15 sec

Table 5-1. Running times over varying database sizes.

Method 0.3% 1% 3% 10%SPRT 1423 sec 1420 sec 1427 sec 3265 sec

End-biased 76 sec 75 sec 75 sec 430 secIndexed 11 sec 4 sec 4 sec 962 sec

Table 5-2. Running times over varying query sizes.

Method 1% 3% 10%SPRT 5734 sec 5608 sec 5690 sec

End-biased 116 sec 75 sec 75 secIndexed 107 sec 12 sec 15 sec

Table 5-3. Running times over varying object standard deviations.

Method 0.8 0.9 0.95SPRT 5672 sec 2869 sec 1436 sec

End-biased 75 sec 75 sec 75 secIndexed 14 sec 12 sec 13 sec

Table 5-4. Running times over varying confidence levels.

are accepted, the spread of each object is so great that most are candidates foracceptance.

Each test is run several times, and results are averaged across all runs.

Methods Tested. For each of the above tests, we test four methods: the SPRT, an

alternative sequential test that is approximately, asymptotically optimal [113], the

end-biased test via sequential scan, and the end-biased test via indexing. In practice, we

found the optimal test to be so slow that it was only used for the first set of tests.

Results. The results are given in Tables 5-1 through 5-4. All times are wall-clock running

times in seconds. The raw data files for a database of size 107 required about 500MB of

storage. The indexed, pre-sampled version of this data file requires around 7GB to store in

its entirety if 500 samples are used.

Discussion.

There are several interesting findings. First and foremost is the terrible relative

performance of the “optimal” sequential test, which was generally about five times slower

109

than Wald’s classic SPRT. The results are so bad that we removed this option after

the first set of experiments. Since we were quite curious about the poor performance,

we ran some additional, exploratory experiments using the optimal test and found

that it can be better than the SPRT, particularly in cases where H0 and H1 were far

apart. Unfortunately, in our application ε is chosen to be quite small and under such

circumstances the optimal test is quite useless. The poor results associated with this test

illustrate quite strongly how asymptotic statistical theory is often quite far away from

practice.

On the other hand, the end-biased test always far outperformed the SPRT, sometimes

by almost two orders of magnitude. This is perhaps not surprising given the fact that,

by design, the end-biased test can quickly reject those multitude of objects where π ≈0. The spread between the two tests was particularly significant for the third set of

experiments, which tests the effect of the object standard deviation (or size) on the

running time. It was interesting that the end-biased test performed better with higher

standard deviation—as the object size increases, fewer objects are accepted by the query

box, which cannot encompass enough of the PDF. The end-biased test appears to be

particularly adept at rejecting such objects quickly. However, the SPRT performance

seemed to be invariant to the size of the object.

Another interesting finding was the sensitivity of the SPRT to the p parameter.

For lower confidence, the test was far more expensive. This is because as the confidence

is lowered, the actual probability that an object is in the query box gets closer to the

user-defined input parameter. As this happens, the SPRT has a harder time rejecting the

object.

The results regarding the index were informative as well. It is not surprising that

the indexed method was almost always the fastest for the four. For the 10 million record

database, it seems that the standard end-biased test bottoms with at a sequential scan

plus processing time of about 70 seconds. However, the indexed, end-biased method is able

110

to cut this baseline cost down to under ten seconds for the same database size—though

the time taken from query to query tended to vary a lot more for the index than the other

methods.

It is interesting that in the cases where the regular end-biased test becomes more

expensive than its 70 second baseline, (for example, consider the first column in Table 5-3)

the indexed version also suffers to almost the same extent. This is not surprising. The

reason for an increased cost for the un-indexed version is that a large number of objects

were encountered that required many samples. Perhaps there were even a few objects

that required an extreme number of samples—numbers in the billions happen occasionally

when π is very close to p. The indexed version is no better than the un-indexed version

in this case; it cannot dismiss such an expensive object outright using the index, and the

few, pre-computed, indexed samples it has access to are useless if millions of samples are

eventually required to accept or reject the object.

Perhaps the most interesting case with respect to the index is the fourth column of

Table 5-2, where the indexed end-biased test actually doubles the running time of the

un-indexed version. The explanation here seems to be that this particular query has

the largest result set size. The size is so large, in fact, that use of the index induces an

overhead and actually slows query evaluation—a phenomenon that is always possible in

database indexing.

5.6 Related Work

Since the SPRT was first proposed by Wald in 1947 [111], sequential statistics

have been widely studied. Wald’s original SPRT is proven to be optimal for values

lying exactly at H0 or H1; in other cases, it may perform poorly. Kieffer and Weiss

first raised the question of developing tests that are provably optimal at points other

than those covered by H0 and H1 [114]. However, in the general case, this problem has

not been solved, though there has been some success in solving the asymptotic case

where (α = β) → 0. Such a solution was first proposed by Lorden in 1976 [115] where

111

he showed that Keifer-Weiss optimality can be achevied asymptotically under certain

conditions. Well-known follow-up work is due to Eisenberg [116], Huffman [113], and

Pavlov [117]. Work in this area continues to this day. However, a reasonable criticism

of much of this work is its focus on asymptotic optimality—particularly its focus on

applications having vanishingly small (and equal) values of α and β. It is unclear how

well such asymptotically optimal tests perform in practice, and the statistical research

literature provides surprisingly little in the way of guidance, which was our motivation

for undertaking this line of work. In our particular application, α and β are not equal (in

fact, they will most often differ by orders of magnitude), and it is unclear to us whether

practical and clearly superior alternatives to Wald’s original proposal exist in such a case.

In contrast to related work pure and applied statistics, we seek no notion of optimality;

our goal is to design a test that is (a) correct, and (b) experimentally proven to work

well in the case where ε is vanishingly small, and yet more often than not, the “true”

probability of object inclusion is either zero or one.

In the database literature, the paper that is closest to our indexing proposal is due to

Tao et al. [118]. They consider the problem of indexing spatial data where the position is

defined by a PDF. However, they assume that the PDF is non-infinite, and integratable.

The assumption of finiteness may be reasonable for many applications (since many

distributions, such as the Gaussian, fall off exponentially in density as distance from the

mean increases). However, integratability is a strong assumption, precluding, for example,

many distributions resulting from Bayesian inference [62] that can only be sampled from

using MCMC methods [107].

Most of the work in probabilistic databases is at least tangentially related to our own,

in that our goal is to represent uncertainty. We point the reader to Dalvi and Suciu for a

general treatment of the topic [119]. The paper most closely related to this work is due to

Jampani et al. [109] who propose a data model for uncertain data that is fundamentally

based upon Monte Carlo.

112

5.7 Summary

In this chapter, we have considered the problem of using Monte Carlo methods to

test which object in a probabilistic database are accepted by a query predicate. Our two

primary contributions are (1) the definition of a new sequential test of whether or not the

probability that an object is accepted by the query predicate which strictly controls both

the false positive and false negative rates, and (2) an indexing methodology for the test.

The test was found to work quite well in practice, and the indexing is very successful in

speeding the test’s application in practice.

We close the chapter by pointing out that our goal was not to make a contribution to

statistical theory, and arguably, we have not! Most of the relevant statistical literature is

concerned with various definitions of optimality, and while our new test is correct, there

is no sense in which it is optimal. However, we believe that our new test is of practical

significance to the implementation of probabilistic databases. The experimental evidence

that it works well is strong, and there is also strong intuition behind the design of the

test. In practice, the new test outperforms an oft-cited “optimal” test from the relevant

statistical literature for database selection problems, and while a more appropriate test

may exist, we are unaware of a more suitable candidate for solving the problem at hand.

113

CHAPTER 6CONCLUDING REMARKS

With the increasing popularity of tracking devices, and the decreasing cost of storage,

large spatiotemporal data collections are becoming increasingly more commonplace.

Extending current database systems to support such collections require the development of

new solutions. The work presented in this study represents a small step in that direction.

The main theme of this work was on developing scalable and efficient algorithms for

processing historical spatiotemporal data, particularly in a warehouse context. As much

as this work solves some important research issues, it also opens new avenues for future

research. Some potential directions for further development include:

• The CPA-join focused on historical queries over two spatiotemporal relations.Extending the work to support predictive queries would be an interesting exercise.Unlike historical queries that span long time intervals, predictive queries are ofteninterested in short time intervals. This could make the use of indexes potentiallymore attractive.

• The version of entity resolution considered in this work assumed simple binarysensors that provide limited information about the tracked objects. This howeverlimits the ability of the algorithms to discriminate between closely moving objects.The accuracy however could be improved if one considers sensors that provide aricher feature set (such as sensors that provide additional color information). Thiswould provide the algorithms an additional dimension to differentiate the observationclouds.

• Finally, we focused only on answering probabilistic spatiotemporal selection queriesusing the end-biased test. However, statistical hypothesis testing, on which theend-biased test is built upon, is a basic technique used in many fields of science andengineering. Hence, the end-biased algorithm proposed in this work has potentiallybroad applicability besides probabilistic databases.

114

REFERENCES

[1] R. Guting and M. Schneider, Moving Object Databases, Morgan Kaufmann, 2005.

[2] D. Papadias, D. Zhang, and G. Kollios, Advances in Spatial and Spatio TemporalData Management, Springer-Verlag, 2007.

[3] J.Schillier and A.Voisard, Location-Based Services, Morgan Kaufmann, 2004.

[4] Y.Zhang and O.Wolfson, Satellite-based information services, Kluwer AcademicPublishers, 2002.

[5] W.I.Grosky, A. Kansal, S. Nath, J. Liu, and F.Zhao, “Senseweb: An infrastructurefor shared sensing,” in IEEE Multimedia, 2007.

[6] Cover, “Mandate for change,” RFID Journal, 2004.

[7] G.Abdulla, T.Critchlow, and W.Arrighi, “Simulation data as data streams,”SIGMOD Record 33(1):89-94, 2004.

[8] N. Pelekis, B. Theodoulidis, I. Kopanakis, and Y. Theodoridis, “Literature review ofspatio-temporal database models,” in The Knowledge Engineering Review, 2004.

[9] A.P.Sistla, O.Wolfson, S.Chamberlain, and S.Dao, “Modeling and querying movingobjects,” in ICDE, 1997.

[10] M. Erwig, R. Guting, M. Schneider, and M. Vazirgianni, “A foundation forrepresenting and querying moving objects,” in TODS, 2000.

[11] L.Forlizzi, R.H.Guting, E.Nardelli, and M.Schneider, “A data model and datastructures for moving objects databases,” in SIGMOD, 2000.

[12] C.Parent, S.Spaccapietra, and E.Zimanyl, “Spatiotemporal conceptual models: Datastructures + space + time,” in GIS, 1999.

[13] N.Tryfona, R.Price, and C.S.Jensen, “Conceptual models for spatiotemporalapplications,” in The CHOROCHRONOS Approach, 2002.

[14] E. Tossebro, “Representing uncertainty in spatial and spatiotemporal databases,” inPhd Thesis, 2002.

[15] M. Erwig and S.Schneider, “Stql: A spatiotemporal query language,” in Miningspatio-temporal information systems, 2002.

[16] R.Guttman, “R-trees: a dynamic index structure for spatial searching,” in SIGMOD,1984.

[17] S. Saltenis, C. Jensen, S. Leutengger, and M. Lopez, “Indexing the positions ofcontinuously moving objects,” in SIGMOD, 2000.

115

[18] P. Chakka, A. Everspaugh, and J. Patel, “Indexing large trajectory data sets withSETI,” in CIDR, 2003.

[19] J.Patel, Y.Chen, and P.Chakka, “Stripes: An efficient index for predictedtrajectories,” in SIGMOD, 2004.

[20] S. Theodoridis, “Spatio-temporal Indexing for Large Multimedia Applications,” inIEEE Int’l Conference on Multimedia Computing and Systems, 1996.

[21] D. Pfoser, C. S. Jensen, and Y. Theodoridis, “Novel approaches to the indexing ofmoving object trajectories,” in VLDB, 2000.

[22] T. Tzouramanis, M. Vassilakopoulos, and Y. Manolopoulos, “Overlapping linearquadtrees: A spatio-temporal access method,” in Advances in GIS, 1998.

[23] G.Kollios, D.Gunopulos, and V.J.Tsotras, “Nearest neighbor queries in a mobileenvironment,” in Spatiotemporal database management, 1999.

[24] Z.Song and N.Roussopoulos, “K-nearest neighbor search for moving query point,” inSymp. on Spatial and Temporal Databases, 2001.

[25] Z.Huang, H.Lu, B. Ooi, and A. Tung, “Continuous skyline queries for movingobjects,” in TKDE, 2006.

[26] G.Kollios, D.Gunopulos, and V.J.Tsotras, “An improved R-tree indexing fortemporal spatial databases,” in SDH, 1990.

[27] Y.Tao and D.Papadias, “Mv3r-tree: A spatiotemporal access method for timestampand interval queries,” in VLDB, 2001.

[28] M.A.Nascimento and J.R.O.Silva, “Towards historical R-trees,” in ACM SAC, 1998.

[29] G.Iwerks, H.Samet, and K.P.Smith, “Maintenance of spatial semijoin queries onmoving points,” in VLDB, 2004.

[30] S.Arumugam and C.Jermaine, “Closest-point-of-approach join over moving objecthistories,” in ICDE, 2006.

[31] Y.Choi and C.Chung, “Selectivity estimation for spatio-temporal queries to movingobjects,” in SIGMOD, 2002.

[32] M.Schneider, “Evaluation of spatio-temporal predicates on moving objects,” inICDE, 2005.

[33] Y.Tao, J.Sun, D.Papadias, and G.Kollios, “Analysis of predictive spatio-temporalqueries,” in TODS, 2003.

[34] J.Sun, Y.Tao, D.Papadias, and G.Kollios, “Spatiotemporal join selectivity,” inInformation Systems, 2006.

116

[35] M. Vlachos, G. Kollios, and D. Gunopulos, “Discovering similar multidimensionaltrajectories,” in ICDE, 2002.

[36] J.Kubica, A.Moore, A.Connolly, and R.Jedicke, “A multiple tree algorithm for theefficient association of asteroid observations,” in KDD, 2005.

[37] S. Gaffney and P. Smyth, “Trajectory Clustering with Mixtures of RegressionModels,” in KDD, 1999.

[38] Y.Li, J.Han, and J.Yang, “Clustering Moving Objects,” in KDD, 2004.

[39] J.Lee, J.Han, and K.Whang, “Trajectory clustering: A partition-and-groupframework,” in SIGMOD, 2007.

[40] D.Guo, J.Chen, A. MacEachren, and K.Liao, “A visualization system for space-timeand multivariate patterns,” in IEEE Transactions on Visualization and ComputerGraphcis, 2006.

[41] D. Papadias, Y. Tao, P. Kalnis, and J. Zhang, “Indexing spatio-temporal datawarehouses,” in ICDE, 2002.

[42] N.Mamoulis, H.Cao, G.Kollios, M.Hadjieleftheirou, Y.Tao, and D.Cheung, “Mining,indexing, and querying historical spatiotemporal data,” in KDD, 2004.

[43] Y.Tao, G.Kollios, J.Considine, F.Li, and D.Papadias, “Spatio-temporal aggregationusing sketches,” in ICDE, 2004.

[44] D. Papadias, Y.Tao, P.Kalnis, and J.Zhang, “Historical spatio-temporalaggregation,” in Trans. of Information Systems, 2005.

[45] T.Brinkhoff, H.P.Kriegel, and B.Seeger, “Efficient processing of spatial-joins usingR-trees,” in SIGMOD, 1993.

[46] Y.W.Huang, N.Jing, and E.A.Rundensteiner, “Spatial joins using R-trees:Breadth-first traversal with global optimizations,” in VLDB, 1997.

[47] M. Lo and C.V.Ravishankar, “Spatial hash joins,” in SIGMOD, 1996.

[48] J. Patel and D. DeWitt, “Partition based spatial-merge join,” in SIGMOD, 1996.

[49] L. Arge, O.Procopiu, and S. T. J.S.Vitter, “Scalable sweeping-based spatial join,” inVLDB, 1998.

[50] M. Berg, M. Kreveld, M.Overmars, and O.Schwarzkopf, Computational Geomtery:Algorithms and Applictions, Springer-Verlag, 2000.

[51] S.H.Jeong, N.W.Paton, A. Fernandes, and T. Griffiths, “An experimentalperformance evaluation of spatio-temporal join strategies,” in Transactions inGIS, 2004.

117

[52] W. Winkler, “Matching and record linkage,” in Business Survey Methods, 1995.

[53] M.Hernandez and S.Stolfo, “The merge/purge problem for large databases,” inSIGMOD, 1995.

[54] C. E. A. Monge, “The field matching problem: Algorithms and applications,” inKDD, 1996.

[55] W. Cohen and J. Richman, “Learning to match and cluster large high-dimensionaldata sets for data integration,” in KDD, 2002.

[56] Y.Bar-Shalom and T.Fortmann, “Tracking and data association,” in AcademicPress, 1988.

[57] B.Ristic, S.Arulampalam, and N.Gordon, “Beyond the kalman filter: Particle filtersfor tracking applications,” in Artech House Publishers, 2004.

[58] D.B.Reid, “An algorithm for tracking multiple targets,” in IEEE Trans. Automat.Control, 1979.

[59] X.Li, “The pdf of nearest neighbor measurement and a probabilistic nearestneighbor filter for tracking in clutter,” in IEEE Control and Decision Conference,1993.

[60] I. Cox and S.L.Hingorani, “An efficient implentation of reid’s multiple hypothesistracking alogrithm and its evaluation for the purpose of Visual Tracking,” in Intl.Conf. on Pattern Recognition, 1994.

[61] T.Song, D.Lee, and J.Ryu, “A probabilistic nearest neighbor filter algorithm fortracking in a clutter environment,” in Signal Processing, Elsevier Science, 2005.

[62] A. O’Hagan and J. J. Forster, Bayesian Inference, Volume 2B of Kendall’s AdvancedTheory of Statistics. Arnold, second edition, 2004.

[63] A. Doucet, C. Andrieu, and S. Godsill, “On sequential monte carlo samplingmethods for bayesian filtering,” Statistics and Computing, vol. 10, pp. 197–208, 2000.

[64] D.Fox, J.Hightower, L.Liao, D.Schulz, and G.Borriello, “Bayesian filtering forlocation estimation,” in IEEE Pervasive Computing, 2003.

[65] Z.Khan, T.Balch, and F.Dellaert, “An mcmc-based particle filter for mulitipleinteracting targets,” in ECCV, 2004.

[66] S. Oh, S. Russell, and S. Sastry, “Markov Chain Monte Carlo data association forgeneral multiple-target tracking problems,” in IEEE Conf. on Decision and Control,2004.

[67] O.Wolfson, S.Chamberlain, S.Dao, L.Jiang, and G.Mendez, “Cost and imprecision inmodeling the precision of moving objects,” in ICDE, 1998.

118

[68] D.Pfoser, “Capturing the uncertainty of moving objects,” in LNCS, 1999.

[69] J.H.Hosbond, S.Saltenis, and R.Ortfort, “Indexing uncertainty of continuouslymoving objects,” in IDEAS, 2003.

[70] C.Trajcevski, O.Wolfson, K.Hinrichs, and S.Chamberlain, “Managing uncertainty inmoving object databases,” in TODS, 2004.

[71] R.Cheng, D.Kalashikov, and S.Prabhakar, “Querying imprecise data in movingobject environments,” in TKDE, 2004.

[72] Y.Tao, R.Cheng, and X.Xiao, “Indexing multidimensional uncertain data witharbitrary probability density functions,” in VLDB, 2005.

[73] H.Mokhtar and J.Su, “Universal trajectory queries on moving object databases,” inMobile Data Management, 2004.

[74] D. Eberly, 3D Game Engine Design: A Practical Approach to Real-time ComputerGraphics, Morgan Kaufmann, 2001.

[75] M. Mokbel, X. Xiong, and W. Aref, “SINA: Scalable incremental processing ofcontinuous queries in spatio-temporal databases,” in SIGMOD, 2004.

[76] Y. Tao, “Time-parametrized queries in spatio-temporal databases,” in SIGMOD,2004.

[77] S. Saltenis and C. Jensen, “Indexing of moving objects for location-based services,”in ICDE, 2002.

[78] Y. Tao, D. Papadias, and J. Sun, “The TPR*-tree: An optimized spatio-temporalaccess method for predictive queries,” in VLDB, 2003.

[79] O. Gunther, “Efficient computation of spatial joins,” in ICDE, 1993.

[80] S. Leutenegger and J.Edgington, “STR: A simple and efficient algorithm for R-treepacking,” in 13th Intl. Conf. on Data Engineering (ICDE), 1997.

[81] D.Mehta and S.Sahni, Handbook of Data Strutures and Its Applications, Chapmanand Hall, 2004.

[82] P.J.Haas and J.M.Hellerstein, “Ripple joins for online aggregation,” in SIGMOD,1999.

[83] M. Nascimento and J. Silva, “Evaluation of access structures for discretely movingpoints,” in Int’l Workshop on Spatio-Temporal Database Management, 1999.

[84] A.Dempster, N.Laird, and D.Rubin, “Maximum likelihood estimation fromincomplete data via the em,” in Journ. Royal Statistical Society, 1977.

119

[85] J.Bilmes, “A gentle tutorial of the em algorithm and its application to parameterestimation for gaussian mixture and hidden markov models,” in Technical Report,Univ. of Berkeley, 1997.

[86] J.Banfield and A.Raftery, “Model-based gaussian and non-gaussian clustering,” inBiometrics, 1993.

[87] J.Oliver, R.Baxter, and C.Wallace, “Unsupervised learning using mml,” in ICML,1996.

[88] M.Hansen and B. Yu, “Model selection and the principle of minimum descriptionlength,” in Journal of the American Statistical Association, 1998.

[89] M. Figueiredo and A. Jain, “Unsupervised learning of finite mixture models,” inIEEE Trans. on Pattern Analysis and Machine Intelligence, 2002.

[90] R. Baxter, “Minimum message length inference: Theory and applications,” in PhDThesis, 1996.

[91] G.Celeux, S.Chretien, F.Forbes, and A.Mikhadri, “A component-wise em algorithmfor mixtures,” in Journ. of Computational and Graphical Statistics, 1999.

[92] D.Pfoser and C.Jensen, “Trajectory indexing using movement constraints,” inGeoInformatica, 2005.

[93] Y.Cai and R.Ng, “Indexing spatio-temporal trajectories with chebyshevpolynomials,” in SIGMOD, 2004.

[94] S.Rasetic and J.Sander, “A trajectory splitting model for efficient spatio-temporalindexing,” in VLDB, 2005.

[95] D.Chudova, S.Gaffney, E.Mjolsness, and P.Smyth, “Translation-invariant MixtureModels for Curve Clustering,” in KDD, 2003.

[96] H.Kriegel and M.Pfeifle, “Density-based Clustering of Uncertain Data,” in KDD,2005.

[97] J.L.Bentley, “K-d trees for semidynamic point sets,” in Annual Symposium onComputational Geometry, 1990.

[98] L.Frenkel and M.Feder, “Recursive Expectation Maximization algorithms fortime-varying parameters with applications to multiple target tracking,” in IEEETrans. Signal Processing, 1999.

[99] P. Chung, J. Bohme, and A. Hero, “Tracking of multiple moving sources usingrecursive em algorithm,” in EURASIP Journal on Applied Signal Processing, 2005.

[100] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara,and J. Widom, “Trio: A system for data, uncertainty, and lineage,” in VLDB, 2006.

120

[101] P. Andritsos, A. Fuxman, and R. J. Miller, “Clean answers over dirty databases: Aprobabilistic approach,” in ICDE, 2006, p. 30.

[102] L. Antova, C. Koch, and D. Olteanu, “MayBMS: Managing incomplete informationwith probabilistic world-set decompositions,” in ICDE, 2007, pp. 1479–1480.

[103] R. Cheng, S. Singh, and S. Prabhakar, “U-DBMS: A database system for managingconstantly-evolving data,” in VLDB, 2005, pp. 1271–1274.

[104] N. N. Dalvi and D. Suciu, “Efficient query evaluation on probabilistic databases,”VLDB J., vol. 16, no. 4, pp. 523–544, 2007.

[105] N. Fuhr and T. Rolleke, “A probabilistic relational algebra for the integration ofinformation retrieval and database systems,” ACM Trans. Inf. Syst., vol. 15, no. 1,pp. 32–66, 1997.

[106] R. Gupta and S. Sarawagi, “Creating probabilistic databases from informationextraction models,” in VLDB, 2006, pp. 965–976.

[107] C. Robert and G. Casella, Monte Carlos Statistical Methods, Springer, secondedition, 2004.

[108] J. E. Gentle, Random Number Generation and Monte Carlo Methods, Springer,second edition, 2003.

[109] R. Jampani, F. Xu, M. Wu, L. P. Ngai, C. Jermaine, and P. Hass, “Mcdb: A montecarlo approach to handling uncertianty,” in SIGMOD, 2008.

[110] J. Neyman and E. Pearson, “On the problem of the most efficient tests of statisticalhypotheses,” Phil. Tran. of the Royal Soc. of London, Series A, vol. 231, pp.289–337, 1933.

[111] A. Wald, Sequential Analysis, Wiley, 1947.

[112] J. Galambos and I. Simonelli, Bonferroni-Type Inequalities with Applications,Springer-Verlag, 1996.

[113] M. Huffman, “An efficient approximate solution to the kiefer-weiss problem,” in TheAnnals of Statistics, 1983, vol. 11, pp. 306–316.

[114] J. Kiefer and L. Weiss, “Some properties of generalized sequential probability ratiotests,” in The Annals of Mathematical Statistics, 1957, vol. 28, pp. 57–74.

[115] G. Lorden, “2-sprts and the modified keifer-weiss problem of minimizing an expectedsample size,” in The Annals of Statistics, 1976, vol. 4, pp. 281–291.

[116] B. Eisenberg, “The asymptotic solution to the keifer-weiss problem,” in Comm.Statistics C-Sequential Analysis, 1982, vol. 1, pp. 81–88.

121

[117] I. Pavlov, “Sequential procedure of testing compositie hypotheses with applicationto the keifer-weiss problem,” in Theory of Probability and Its Applications, 1991,vol. 35, pp. 280–292.

[118] Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar, “Indexingmulti-dimensional uncertain data with arbitrary probability density functions,” inVLDB, 2005, pp. 922–933.

[119] N. Dalvi and D. Suciu, “Management of probabilistic data: foundations andchallenges,” in PODS, 2007, pp. 1–12.

122

BIOGRAPHICAL SKETCH

Subramanian Arumugam is a member of the query processing team at the database

startup, Greenplum. He is a recipient of the 2007 ACM SIGMOD Best Paper Award.

He received his bachelor’s degree from the University of Madras in 2000. He obtained his

master’s in computer engineering in 2003, and his PhD in computer engineering in 2008

both from the University of Florida.

123

c 2008 subramanian arumugam - university of...

Documents