c 2008 subramanian arumugam - university of...
TRANSCRIPT
EFFICIENT ALGORITHMS FOR SPATIOTEMPORAL DATA MANAGEMENT
By
SUBRAMANIAN ARUMUGAM
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2008
1
c© 2008 Subramanian Arumugam
2
To my parents.
3
ACKNOWLEDGMENTS
First of all, I would like to thank my advisor Chris Jermaine. This dissertation would
not have been made possible had it not been for his excellent mentoring and guidance
through the years. Chris is a terrific teacher, a critical thinker and a passionate researcher.
He has served as a great role model and has helped me mature as a researcher. I cannot
thank him more for that.
My thanks also goes to Prof. Alin Dobra. Through the years, Alin has been a patient
listener and has helped me structure and refine my ideas countless times. His excitement
for research is contagious!
I would like to take this opportunity to mention my colleagues at the database
center: Amit, Florin, Fei, Luis, Mingxi and Ravi. I have had many hours of fun discussing
interesting problems with them. Special thanks goes to my friends Manas, Srijit, Arun,
Shantanu, and Seema, for making my stay in Gainesville all the more enjoyable.
Finally, I would like thank my parents for being a source of constant support and
encouragment throughout my studies.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Research Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Data Modeling and Database Design . . . . . . . . . . . . . . . . . 151.2.2 Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.2.5 Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3.1 Scalable Join Processing over Massive Spatiotemporal Histories . . . 181.3.2 Entity Resolution in Spatiotemporal Databases . . . . . . . . . . . . 191.3.3 Selection Queries over Probabilistic Spatiotemporal Databases . . . 19
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Spatiotemporal Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3 Probabilistic Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES . 25
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Moving Object Trajectories . . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Closest Point of Approach (CPA) Problem . . . . . . . . . . . . . . 28
3.3 Join Using Indexing Structures . . . . . . . . . . . . . . . . . . . . . . . . 303.3.1 Trajectory Index Structures . . . . . . . . . . . . . . . . . . . . . . 313.3.2 R-tree Based CPA Join . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Join Using Plane-Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Basic CPA Join using Plane-Sweeping . . . . . . . . . . . . . . . . . 363.4.2 Problem With The Basic Approach . . . . . . . . . . . . . . . . . . 373.4.3 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Adaptive Plane-Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5
3.5.2 Cost Associated With a Given Granularity . . . . . . . . . . . . . . 413.5.3 The Basic Adaptive Plane-Sweep . . . . . . . . . . . . . . . . . . . 413.5.4 Estimating Costα . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5.5 Determining The Best Cost . . . . . . . . . . . . . . . . . . . . . . . 443.5.6 Speeding Up the Estimation . . . . . . . . . . . . . . . . . . . . . . 463.5.7 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6.1 Test Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6.2 Methodology and Results . . . . . . . . . . . . . . . . . . . . . . . . 503.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 ENTITY RESOLUTION IN SPATIOTEMPORAL DATABASES . . . . . . . . 58
4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2 Outline of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 PDF for Restricted Motion . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 PDF for Unrestricted Motion . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Learning the Restricted Model . . . . . . . . . . . . . . . . . . . . . . . . . 664.4.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 674.4.2 Learning K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Learning Unrestricted Motion . . . . . . . . . . . . . . . . . . . . . . . . . 714.5.1 Applying a Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . 724.5.2 Handling Multiple Objects . . . . . . . . . . . . . . . . . . . . . . . 734.5.3 Update Strategy for a Sample given Multiple Objects . . . . . . . . 754.5.4 Speeding Things Up . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5 SELECTION OVER PROBABILISTIC SPATIOTEMPORAL RELATIONS . . 84
5.1 Problem and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.1.2 The False Positive Problem . . . . . . . . . . . . . . . . . . . . . . . 875.1.3 The False Negative Problem . . . . . . . . . . . . . . . . . . . . . . 90
5.2 The Sequential Probability Ratio Test (SPRT) . . . . . . . . . . . . . . . . 915.3 The End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 What’s Wrong With the SPRT? . . . . . . . . . . . . . . . . . . . . 955.3.2 Removing the Magic Epsilon . . . . . . . . . . . . . . . . . . . . . . 965.3.3 The End-Biased Algorithm . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Indexing the End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.2 Building the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6
5.4.3 Processing Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7
LIST OF TABLES
Table page
4-1 Varying the number of objects and its effect on recall, precision and runtime. . . 80
4-2 Varying the number of time ticks. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4-3 Varying the number of sensors fired. . . . . . . . . . . . . . . . . . . . . . . . . 80
4-4 Varying the standard deviation of the Gaussian cloud. . . . . . . . . . . . . . . 80
4-5 Varying the number of time ticks where EM is applied. . . . . . . . . . . . . . . 81
5-1 Running times over varying database sizes. . . . . . . . . . . . . . . . . . . . . . 109
5-2 Running times over varying query sizes. . . . . . . . . . . . . . . . . . . . . . . 109
5-3 Running times over varying object standard deviations. . . . . . . . . . . . . . . 109
5-4 Running times over varying confidence levels. . . . . . . . . . . . . . . . . . . . 109
8
LIST OF FIGURES
Figure page
3-1 Trajectory of an object (a) and its polyline approximation (b) . . . . . . . . . . 28
3-2 Closest Point of Approach Illustration . . . . . . . . . . . . . . . . . . . . . . . 29
3-3 CPA Illustration with trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3-4 Example of an R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3-5 Heuristic to speed up distance computation . . . . . . . . . . . . . . . . . . . . 34
3-6 Issues with R-trees- Fast moving object p joins with everyone . . . . . . . . . . 35
3-7 Progression of plane-sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3-8 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3-9 Problem with using large granularities for bounding box approximation . . . . . 40
3-10 Adaptively varying the granularity . . . . . . . . . . . . . . . . . . . . . . . . . 42
3-11 Convexity of cost function illustration. . . . . . . . . . . . . . . . . . . . . . . . 45
3-12 Iteratively evaluating k cut points . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3-13 Speeding up the Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3-14 Injection data set at time tick 2,650 . . . . . . . . . . . . . . . . . . . . . . . . . 49
3-15 Collision data set at time tick 1,500 . . . . . . . . . . . . . . . . . . . . . . . . . 50
3-16 Injection data set experimental results . . . . . . . . . . . . . . . . . . . . 51
3-17 Collision data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 52
3-18 Buffer size choices for Injection data set . . . . . . . . . . . . . . . . . . . . . . 53
3-19 Buffer size choices for Collision data set . . . . . . . . . . . . . . . . . . . . . . 53
3-20 Synthetic data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 54
3-21 Buffer size choices for Synthetic data set . . . . . . . . . . . . . . . . . . . . . . 56
4-1 Mapping of a set of observations for linear motion . . . . . . . . . . . . . . . . . 60
4-2 Object path (a) and quadratic fit for varying time ticks (b-d) . . . . . . . . . . . 62
4-3 Object path in a sensor field (a) and sensor firings triggered by object motion (b) 64
4-4 The baseline input set (10,000 observations) . . . . . . . . . . . . . . . . . . . . 79
9
4-5 The learned trajectories for the data of Figure 4-4 . . . . . . . . . . . . . . . . . 79
5-1 The SPRT in action. The middle line is the LRT statistic . . . . . . . . . . . . . 92
5-2 Two spatial queries over a database of objects with gaussian uncertainty . . . . 97
5-3 The sequence of SPRTs run by the end-biased test . . . . . . . . . . . . . . . . 98
5-4 Building the MBRs used to index the samples from the end-biased test. . . . . . 104
5-5 Using the index to speed the end-biased test . . . . . . . . . . . . . . . . . . . . 106
10
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
EFFICIENT ALGORITHMS FOR SPATIOTEMPORAL DATA MANAGEMENT
By
Subramanian Arumugam
August 2008
Chair: Christopher JermaineMajor: Computer Engineering
This work focuses on interesting data management problems that arise in the analysis,
modeling and querying of largescale spatiotemporal data. Such data naturally arise in the
context of many scientific and engineering applications that deal with physical processes
that evolve over time.
We first focus on the issue of scalable query processing in spatiotemporal databases.
In many applications that produce a large amount of data describing the paths of moving
objects, there is a need to ask questions about the interaction of objects over a long
recorded history. To aid such analysis, we consider the problem of computing joins over
moving object histories. The particular join studied is the “Closest-Point-Of-Approach”
join, which asks: Given a massive moving object history, which objects approached within
a distance d of one another?
Next, we study a novel variation of the classic entity resolution problem that
appears in sensor network applications. In entity resolution, the goal is to determine
whether or not various bits of data pertain to the same object. Given a large database of
spatiotemporal sensor observations that consist of (location, timestamp) pairs, our goal is
to perform an accurate segmentation of all of the observations into sets, where each set is
associated with one object. Each set should also be annotated with the path of the object
through the area.
11
Finally, we consider the problem of answering selection queries in a spatiotemporal
database, in the presence of uncertainty incorporated through a probabilistic model.
We propose very general algorithms that can be used to estimate the probability that a
selection predicate evaluates to “true” over a probabilistic attribute or attributes, where
the attributes are supplied only in the form of a pseudo-random attribute value generator.
This enables the efficient evaluation of queries such as “Find all vehicles that are in close
proximity to one another with probability p at time t” using Monte Carlo statistical
methods.
12
CHAPTER 1INTRODUCTION
This study is a step towards addressing some of the many issues faced in extending
current database technology to handle spatiotemporal data in a seamless and efficient
manner. This chapter motivates spatiotemporal data management and introduces
the reader to the main research issues. This is followed by a summary of the key
contributions.
1.1 Motivation
The last few years have seen a siginificant interest in extending databases to support
spatiotemporal data as evidenced by the growing number of books, workshops, and
conferences devoted to this topic [1–4]. The advent of computational science and the
increasing use of wireless technology, sensors, and devices such as GPS has resulted in
numerous potential sources of spatio-temporal data. Large volumes of spatiotemporal data
are produced by many scientific, engineering and business applications that track and
monitor moving objects. ‘ ‘Moving objects” may be people, vehicles, wildlife, products
in transit, weather systems. Such applications often arise in the context of traffic
surveillance and monitoring, land use management in GIS, simulation in astrophysics,
climate monitoring in earth sciences, fleet management, mulitmedia animation, etc. The
increasing importance of spatiotemporal data can be attributed to the improved reliability
of tracking devices and their low cost, which has reduced the acquisition barrier for such
data. Tracking devices have been adopted in varying degrees in a number of scientific
and enterprise application domains. For instance, vehicles increasingly come equipped
with GPS devices which enable location-based services [3]. Sensors play an increasingly
important role in surveillance and monitoring of physical spaces [5]. Enterprises such
as Walmart, Target and organizations like the Department of Defense (DoD) plan to
track products in their supply chain through use of smart Radio Frequency Identification
(RFID) labels [6].
13
Extending modern database systems to support spatiotemporal data is challenging for
several reasons:
• Conventional databases are designed to manage static data, whereas spatiotemporaldata describe spatial geometries that change continuously with time. This requires aunified approach to deal with aspects of spatiality and temporality.
• Current databases are designed to manage data that is precise. However, uncertaintyis often an inherent property in spatiotemporal data due to discretization ofcontinuous movement and measurement errors. The fact that most spatiotemporaldata sources (particularly polling and sampling-based schemes) provide only adiscrete snapshot of continuous movement poses new problems to query processing.For example, consider a conventional database record that stores the fact “JohnSmith earns $200,000” and a spatiotemporal record which stores the fact “JohnSmith walks from point A to point B” in the form of an discretized ordered pair(A,B). In the former case, a query such as “What is the salary of John Smith?”involves dealing with precise data. On the other hand, a spatiotemporal querysuch as “Did John Smith walk through point C between A and B?” requires dealingwith information that is often not known with certainty. Further compounding theproblem is that even the recorded observations are only accurate to within a fewdecimal places. Thus, even queries queries such “Identify all objects located at pointA” may not return meaningful results unless allowed a certain margin for error.
• Due to the presence of the time dimension, spatiotemporal applications have thepotential to produce a large amount of data. The sheer volume of data generatedby spatiotemporal applications presents a computational and data managementchallenge. For instance, it is not uncommon for many scientific processes to producespatiotemporal data in the order of terabytes or even petabytes [7]. Developingscalable algorithms to support query processing over tera- and peta-byte-sizedspatiotemporal data sets is a significant challenge.
• The semantics of many basic operations in a database changes in the presence ofspace and time. For instance, basic operations like joins typically employ equalitypredicates in a classic relational database, whereas equality is rare between twoarbitrary spatiotemporal objects.
1.2 Research Landscape
Over the last decade, database researchers have begun to respond to the challenges
posed by spatiotemporal data. Most of the research efforts is concentrated on supporting
either predictive or historical queries. Within this taxonomy, we can further distinguish
work based on whether they support time-instance or time-interval queries.
14
In predictive queries, the focus is on the future position of the objects and only a
limited time window of the object positions needs to be maintained. On the other hand,
for historical queries, the interest is on efficient retrieval of past history and thus the
database needs to maintain the complete timeline of an object’s past locations. Due to
these divergent requirements, techniques developed for predictive queries are often not
suitable for historical queries.
What follows is a brief tour of the major research areas in spatiotemporal data
management. For a more complete treatment of this topic, the interested reader is
referrred to [1, 3].
1.2.1 Data Modeling and Database Design
Early research focused on aspects of data modeling and database design for
spatiotemporal data [8]. Conventional data types employed in existing databases are
often not suitable to represent spatiotemporal data which describe continuous time-varying
spatial geometries. Thus, there is a need for a spatiotemporal type system that can model
continuously moving data. Depending on whether the underlying spatial object has an
extent or not, abstractions have been developed to model a moving point, line, and region
in two- and three-dimensional space with time considered as the additional dimension
[8–11]. Similarly, early work has also focused on refining existing CASE tools to aid in the
design of spatiotemporal databases. Existing conceptual tools such as ER diagrams and
UML present a non-temporal view of the world and extensions to incorporate temporal
and spatial awareness has been investigated [12, 13].
Recently there has been interest in designing flexible type systems that can model
aspects of uncertainty associated with an object’s spatial location [14]. There has also
been active effort towards designing SQL language extensions for spatiotemporal data
types and operations [15].
15
1.2.2 Access Methods
Efficient processing of spatiotemporal queries requires developing new techniques
for query evaluation, providing suitable access structures and storage mechanisms, and
designing efficient algorithms for the implementation of spatiotemporal operators.
Developing efficient access structures for spatiotemporal databases is an important
area of research. A variety of spatiotemporal index structures have been developed to
support selection queries over both predictive and historical queries, most based on
generalization of the R-tree [16] to incorporate the time dimension. Indexing structures
designed to support predictive queries typically manage object movement within a small
time window and need to handle frequent updates to object locations. A popular choice
for such applications is the TPR-tree [17] and its many variants.
On the other hand, index structures designed to support historical queries need to
manage an object’s entire past movement trajectory (for this reason they can be viewed as
trajectory indexes). Depending on the time interval indexed, the sheer volume of data that
needs to be managed present significant technical challenges for overlap-allowing indexing
schemes such as R-trees [16]. Thus, there has been interest in revisiting grid/cell-based
solutions that do not allow overlap, such as SETI [18]. Several tree-based indexing
structures have been developed such as STRIPES [19], 3D R-trees [20], TB trees [21] and
linear quad trees [22]. Further, spatiotemporal extensions of several popular queries such
as nearest-neighbor [23], top-k [24], and skyline [25] have been developed.
1.2.3 Query Processing
The development of efficient index structures has also led to a growing body of
research on different types of queries on spatiotemporal data, such as time-instant and
range queries [26–28], continuous queries, joins [29, 30], and their efficient evaluation
[31, 32]. In the same vein, there has also been seem preliminary work on optimizing
spatio-temporal selection queries [33, 34].
16
Much of the work focuses specifically on indexing two-dimensional space and/or
supporting time-instance or short time-interval selection queries. Thus many indexing
structures often do not scale well for higher-dimensional spaces and have difficulty with
queries over long time intervals. Finally, historical data collections may be huge and joins
over such data require new solutions, since predicates involved are non-traditional (such as
closest point of approach, within, sometimes-possibly-inside, etc.)
1.2.4 Data Analysis
Spatiotemporal data analysis allows us to obtain interesting insights from the stored
data collection. For instance:
• In a road network database, the history of movement of various objects can be usedto understand traffic patterns.
• In aviation, the flight path of various planes can be used in future path planning andcomputing minimum separation constraints to avoid collision.
• In wildlife management, one can understand animal migration patterns from thetrajectories traced by them.
• Pollutants can be traced to their source by studing air flow patterns of aerosolsstored as trajectories.
Research in this area focuses on extending traditional data mining techniques to the
analysis of large spatiotemporal data sets. Of interest includes discovering similiarities
among object trajectories [35], data classification and generalization [36], trajectory
clustering and rule mining [37–39], and supporting interactive visualization for browsing
large spatiotemporal collections [40].
1.2.5 Data Warehousing
Supporting data analysis also requires designing and maintaining large collections of
historical spatiotemporal data, which falls under the domain of data warehousing.
Conventional data warehouses are often designed around the goal of supporting
aggregate queries efficiently. However, the interesting queries in a spatiotemporal data
warehouse seek to discover the interaction patterns of moving objects and understand the
17
spatial and/or temporal relationships that exist between them. Facilitating such queries
in a scalable fashion over terabyte-sized spatiotemporal data warehouses is a significant
challenge. This requires extending traditional data mining techniques to the analysis
of large spatiotemporal data sets to discover spatial and temporal relationships, which
might exist at various levels of granularity involving complex data types. Research in
spatiotemporal data warehousing [41, 42] is relatively new and is focused on refining
existing multidimensional models to support continuous data and defining semantics for
spatiotemporal aggregation [43, 44].
1.3 Main Contributions
It is clear that extending modern database systems to support data management
and analysis of spatiotemporal data require addressing issues that span almost the entire
breadth of database research. A full treatment of the various issues can be the subject of
numerous dissertations! To keep the scope of this dissertation managable, I tackle three
important problems in spatiotemporal data management. The dissertation focuses on
data produced by moving objects, since “moving object” databases represent the most
common application domain for spatiotemporal databases [1]. The three specific problems
considered are described briefly in the following subsections.
1.3.1 Scalable Join Processing over Massive Spatiotemporal Histories
I first consider the scalability problem in computing joins over massive moving
object histories. In applications that produce a large amount of data describing the
paths of moving objects, there is a need to ask questions about the interaction of
objects over a long recorded history. This problem is becoming especially important
given the emergence of computational, simulation-based science (where simulations
of natural phenomenon naturally produce massive databases containing data with
spatial and temporal characteristics), and the increased prevalence of tracking and
positioning devices such as RFID and GPS. The particular join that I study is the “CPA”
(Closest-Point-Of-Approach) join, which asks: Given a massive moving object history,
18
which objects approached within a distance d of one another? I carefully consider several
obvious strategies for computing the answer to such a join, and then propose a novel,
adaptive join algorithm which naturally alters the way in which it computes the join in
response to the characteristics of the underlying data. A performance study over two
physics-based simulation data sets and a third, synthetic data set validates the utility of
my approach.
1.3.2 Entity Resolution in Spatiotemporal Databases
Next, I consider the problem of entity resolution for a large database of spatio-temporal
sensor observations. The following scenario is assumed. At each time-tick, one or more of
a large number of sensors report back that they have sensed activity at or near a specific
spatial location. For example, a magnetic sensor may report that a large metal object has
passed by. The goal is to partition the sensor observations into a number of subsets so
that it is likely that all of the observations in a single subset are associated with the same
entity, or physical object. For example, all of the sensor observations in one partition may
correspond to a single vehicle driving accross the area that is monitored. The dissertation
describes a two-phase, learning-based approach to solving this problem. In the first phase,
a quadratic motion model is used to produce an initial classification that is valid for a
short portion of the timeline. In the second phase, Bayesian methods are used to learn the
long-term, unrestricted motion of the underlying objects.
1.3.3 Selection Queries over Probabilistic Spatiotemporal Databases
Finally, I consider the problem of answering selection queries in the presence of
uncertainty incorporated through a probabilistic model. One way to facilitate the
representation of uncertainty in a spatiotemporal database is by allowing tuples to
have “probabilistic attributes” whose actual values are unknown, but are assumed
to be selected by sampling from a specified distribution. This can be supported by
including a few, pre-specified, common distributions in the database system when it is
shipped. However, to be truly general and extensible and support distributions that
19
cannot be represented explicitly or even integrated, it is necessary to provide an interface
that allows the user to specify arbitrary distributions by implementing a function that
produces pseudo-random samples from the desired distribution. Allowing a user to specify
uncertainty via arbitrary sampling functions creates several interesting technical challenges
during query evaluation. Specifically, evaluating time-instance selection queries such as
“Find all vehicles that are in close proximity to one another with probability p at time
t” requires the principled use of Monte Carlo statistical methods to determine whether
the query predicate holds. To support such queries, the thesis describes new methods
that draw heavily for the relevant statistical theory on sequential estimation. I also
consider the problem of indexing for the Monte Carlo algorithms, so that samples from the
pseudo-random attribute value generator can be pre-computed and stored in a structure in
order to answer subsequent queries quickly.
Organization. The rest of this study is organized as follows. Chapter 2 provides
a survey of work related to the problems addressed in this thesis. Chapter 3 tackles the
scalability issue when processing join queries over massive spatiotemporal databases.
Chapter 4 describes an approach to handling the entity-resolution problem in cleaning
spatiotemporal data sources. Chapter 5 describes a simple and general approach to
answering selection queries over spatiotemporal databases that incorporate uncertainty
within a probabilistic model framework (selection queries over probabilistic spatiotemporal
databases). Chapter 6 concludes the dissertation by summarizing the contributions and
identifying potential directions for future work.
20
CHAPTER 2BACKGROUND
This section provides a survey of literature related to the three problems addressed in
this dissertation.
2.1 Spatiotemporal Join
Though research in spatiotemporal joins is relatively new, the closely related problem
of processing joins over spatial objects has been extensively studied. The classical paper
in spatial joins is due to Brinkhoff et al. [45]. Their approach assumes the existence
of a hierarchical spatial index, such as an R-tree [16], on the underlying relations. The
join Brinkhoff proposes makes use of a carefully synchronized depth-first traversal of the
underlying indices to narrow down the candidate pairs. A breadth-first strategy with
several additional optimizations is considered by Huang et al. [46]. Lo and Ravishankar
[47] explore a non-index based approach to processing a spatial join. They consider how
to extend the traditional hash join algorithm to the spatial join problem and propose a
strategy based on a partitioning of the database objects into extent mapping hash buckets.
A similar idea, referred to as the partition-based spatial merge (PBSM), is considered
by Patel et al. [48]. Instead of partitioning the input data objects, they consider a grid
partitioning of the data space on to which objects are mapped. This idea is further
extended by Lars et al. [49], where they propose a dynamic partitioning of the input space
into vertical strips. Their strategy avoids the data spill problem encountered by previous
approaches since the strips can be constructed such that they fit within the available main
memory.
A common theme among existing approaches is their use of the plane-sweep [50] as a
fast pruning technique. In the case of index-based algorithms, plane-sweep is used to filter
candidate node pairs enumerated from the traversal. Non-indexed based algorithms make
use of the plane-sweep to construct candidate sets over partitions.
21
To our knowledge, the only prior work on spatiotemporal joins is due to Jeong et al.
[51]. However, they only consider spatiotemporal join techniques that are straightforward
extensions to traditional spatial join algorithms. Further, they limit their scope to
index-based algorithms for objects over limited time windows.
2.2 Entity Resolution
Research in entity resolution has a long history in databases [52–55] and has focused
mainly on integrating non-geometric string based data from noisy external sources. Closely
related to the work in this thesis is the large body of work on target tracking that exists
in fields as diverse as signal processing, robotics, and computer vision. The goal in target
tracking [56, 57] is to support the real-time monitoring and tracking of a set of moving
objects from noisy observations.
Various algorithms to classify observations among objects can be found in the
target tracking literature. They characterize the problem as one of data association (i.e.
associating observations with corresponding targets). A brief summary of the main ideas is
given below.
The seminal work is due to Reid [58] who propose a multiple hypothesis technique
(MHT) to solve the tracking problem. In the MHT approach, a set of hypotheses is
maintained with each hypothesis reflecting the belief on the location of an individual
target. When a new set of observations arrive, the hypotheses are updated. Hypotheses
with minimal support are deleted and additional hypotheses are created to reflect new
evidence. The main drawback of the approach is that the number of hypotheses can grow
exponentially over time. Though heuristic filters [59–61] can be used to bound the search
space, it limits the scalability of the algorithm.
Target tracking also has been studied using Bayesian approaches [62]. The Bayesian
approach views tracking as a state estimation problem. Given some initial state and a
set of observations, the goal is to predict the object’s next state. An optimal solution to
the problem is given by Bayes Filter [63, 64]. Bayes filters produces optimal estimates by
22
integrating over the complete set of observations. The formulation is often recursive and
involves complex integrals that are difficult to solve analytically. Hence, approximation
schemes such as particle filters [57] and sequential Monte Carlo techniques [63] are often
used in practice.
Recently, Markov Chain Monte Carlo (MCMC) [65, 66] techniques have been
proposed. MCMC techniques attempt to approximate the optimal Bayes filter for multiple
target tracking. MCMC based methods employ sequential MC sampling and are shown to
perform better than existing sub-optimal approaches such as MHT for tracking objects in
highly cluttered environments.
A common theme among most of the research in target tracking is its focus on
accurate tracking and detection of objects in real time in highly cluttered environments
over relatively short time periods. In a data warehouse context, the ability of techniques
such as MCMC to make fine-grained distinctions make them ideal candidates when
performing operations such as drilldown that involve analytics over small time windows.
Their applicability is limited, however, to entity resolution in a data warehouse. In such a
context, summarization and visualization of historical trajectories smoothed over long time
intervals is often more useful. The model-based approach considered in this work seems a
more suitable candidate for such tasks.
2.3 Probabilistic Databases
Uncertainty management in spatiotemporal databases is a relatively new area of
research. Earlier work has focused on aspects of modeling uncertainty and query language
support [9, 67].
In the context of query processing, one of the earliest papers in this area is the
paper by Pfoser et al. [68] where different sources of uncertainty are characterized and
a probability density function is used to model errors. Hosbond et al. [69] extended this
work by employing a hyper square uncertainty region, which expands over time to answer
queries using a TPR-tree.
23
Trajcevksi et al. [70] study the problem from a modeling perspective. They model
trajectories by a cylindrical volume in 3D and outline semantics of fuzzy selection queries
over trajectories in both space and time. However, the approach does not specify how to
choose the dimensions of the cylindrical region which may have to change over time to
account for shrinking or expanding of the underlying uncertainty region.
Cheng et al. [71] describe algorithms for time instant queries (probabilistic range
and nearest neighbor) using an uncertainty model where a probabilty density function
(PDF) and an uncertain region is associated with each point object. Given a location in
the uncertain region, the PDF returns the probablity of finding the object at that location.
A similar idea is used by Tao et al. [72] to answer queries in spatial databases. To handle
time interval queries, Mokhtar et al. [73] represent uncertain trajectories as a stochastic
process with a time-parametric uniform distribution.
24
CHAPTER 3SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES
In applications that produce a large amount of data describing the paths of
moving objects, there is a need to ask questions about the interaction of objects
over a long recorded history. In this chapter, the problem of computing joins over
massive moving object histories is considered. The particular join studied is the
“Closest-Point-Of-Approach” join, which asks: Given a massive moving object history,
which objects approached within a distance d of one another?
3.1 Motivation
Frequently, it is of interest in applications which make use of spatial data to ask
questions about the interaction between spatial objects. A useful operation that enables
one to answer such questions is the spatial join operation. Spatial join is similar to the
classical relational join except that it is defined over two spatial relations based on a
spatial predicate. The objective of the join operation is to retrieve all object pairs that
satisfy a spatial relationship. One common predicate involves distance measures, where
we are interested in objects that were within a certain distance of each other. The query
“Find all restaurants within distance 10 miles from a hotel” is an example of a spatial
join.
For moving objects, the spatial join operation involves the evaluation of both a spatial
and a temporal predicate and for this reason the join is referred to as a spatiotemporal
join. For example, consider the spatial relations PLANES and TANKS, where each relation
represents accumulated trajectory data of planes and tanks from a battlefield simulation.
The query “Find all planes that are within distance 10 miles of a tank” is an example of a
spatiotemporal join. The spatial predicate in this case restricts the distance (10 miles) and
the temporal predicate restricts the time period to the current time instance.
In the more general case, the spatiotemporal join is issued over a moving object
history, which contains all of the past locations of the objects stored in a database. For
25
example, consider the query “Find all pairs of planes that came within distance of 1000
feet during their flight path”. Since there is no restriction of the temporal predicate,
answering this query involves an evaluation of the spatial predicate at every time instance.
The amount of data to be processed can be overwhelming. For example, in a typical
flight, the flight data recorder stores about 7 MB of data which records among other
things, the position and time of the flight for every second during its operation. Given
that on average the US Air Traffic Control handles around 30000 flights in a single day,
if all of this data were archived, it would result in 200 GB of data accumulation just
for a single day. For another example, it is not uncommon for scientific simulations to
output terabytes or even petabytes of spatiotemporal data (see Abdulla et al.[7] and the
references contained therein).
In this chapter, the spatial-temporal join problem for moving object histories in
three-dimensional space, with time considered as the fourth dimension is investigated. The
spatiotemporal join operation considered is the CPA Join (Closest-Point-Of-Approach
Join). By Closest Point of Approach, we refer to a position at which two moving objects
attain their closest possible distance[74]. Formally, in a CPA Join, we answer queries of
the following type: “Find all object pairs (p ∈ P, q ∈ Q) from relations P and Q such that
CPA-distance (p, q) ≤ d”. The goal is to retrieve all object pairs that are within a distance
d at their closest-point-of-approach.
Surprisingly, this problem has not been studied previously. The spatial join problem
has been well-studied for stationary objects in two- and three-dimensionsal space [45, 47–
49], however very little work related to spatiotemporal joins can be found in literature.
There has been some work related to joins involving moving objects [75, 76] but the work
has been restricted to objects in a limited time window and does not consider the problem
of joining object histories that may be gigabytes or terabytes in size.
26
The contributions can be summarized as follows:
• Three spatiotemporal join strategies for data involving moving object histories ispresented.
• Simple adaptations of existing spatial join processing algorithms, based on the R-treestructure and on a plane-sweeping algorithm, for spatiotemporal histories is explored.
• To address the problems associated with straightforward extensions to thesetechniques, a novel join strategy for moving objects based on an extension of thebasic plane-sweeping algorithm is described.
• A rigorous evaluation and benchmarking of the alternatives is provided. Theperformance results suggest that we can obtain significant speedup in execution timewith the adaptive plane-sweeping technique.
The rest of this chapter is organized as follows: In Section 3.2, the closest point
of approach problem is reviewed. In Sections 3.3 and 3.4, two obvious alternatives to
implementing the CPA join – using R-trees and plane-sweeping – is described. In Section
3.5, a novel adaptive plane-sweeping technique that outperforms competing techniques
considerably is presented. Results from our benchmarking experiments are given in Section
3.6. Section 3.7 outlines related work.
3.2 Background
In this Section, we discuss the motion of moving objects, and give an intutive
description of the CPA problem. This is followed by an analytic solution to the CPA
problem over a pair of points moving in a straight line.
3.2.1 Moving Object Trajectories
Trajectories describe the motion of objects in a 2 or 3-dimensional space. Real-world
objects tend to have smooth trajectories and storing them for analysis often involves
approximation to a polyline. A polyline approximation of a trajectory connects object
positions, sampled at discrete time instances, by line segments (Figure 3-1).
In a database the trajectory of an object can be represented as a sequence of the form
〈(t1, ~v1), (t2, ~v2), . . . , (tn, ~vn)〉 where each ~vi represents the position vector of the object at
27
t6
t7
t8
(B)(A)
t3
y
x
t0
t1
t2
t4
t5
t10
t9
Figure 3-1. Trajectory of an object (a) and its polyline approximation (b)
time instance ti. The arity of the vector describes the dimensions of the space. For flight
simulation data, the arity would be 3, whereas for a moving car, the arity would be 2. The
position of the moving objects is normally obtained in one of several ways: by sampling or
polling the object at discrete time instances, through use of devices like GPS, etc.
3.2.2 Closest Point of Approach (CPA) Problem
We are now ready to describe the CPA problem. Let CPA(p,q,d) over two straight
line trajectories be evaluated as follows. Assuming the distance between the two objects is
given by mindist, then we output true if mindist < d (the objects were within distance d
during their motion in space), otherwise false. We refer to the calculation of CPA(p,q,d)
as the CPA problem.
The minimum distance mindist between two objects is the distance between the
object positions at their closest point of approach. It is straightforward to calculate
mindist once the CPA time tcpa, time instance at which the objects reached their closest
distance, is known.
We now give an analytic solution to the CPA problem for a pair of objects on a
simple straight-line trajectory.
Calculating the CPA time tcpa. Figure 3-2 shows the trajectory of two objects p
and q in 2-dimensional space for the time period [tstart, tend]. The position of these objects
at any time instance t is given by p(t) and q(t). Let their positions at time t = 0 be p0
and q0 and let their velocity vectors per unit of time be u and v. The motion equations for
28
t0
t3t4
t1t2
tcpa
t4t3
t2t0
q
distcpa
tstarttend
t1 p
Figure 3-2. Closest Point of Approach Illustration
q[3]p[3]
q[1]
qp
p[2]
p[1]
q[2]
y
x
t
Figure 3-3. CPA Illustration with trajectories
these two objects are p(t) = p0 + tu; q(t) = q0 + tv. At any time instance t, the distance
between the two objects is given by d(t) = |p(t)− q(t)|.Using basic calculus, one can find the time instance at which the distance d(t) is
minimum (when D(t) = d(t)2 is a minimum). Solving for this time we obtain:
tcpa =−(po − qo).(u− v)
|u− v|2
Given this, mindist is given by |p(tcpa)− q(tcpa)|.The distance calculation that we described above is applicable only between two
objects on a straight line trajectory. To calculate the distance between two objects on a
polyline trajectory, we apply the same basic technique. For trajectories consisting of a
chain of line-segments, we find the minimum distance by first determining the distance
between each pair of line-segments and then choosing the minimum distance.
As an example, consider Figure 3-3 which shows the trajectory of two objects in
2-dimensional space with time as the third dimension. Each object is represented by
29
an array that stores the chain of segments comprising the trajectory. The line-segments
are labeled by the array indices. To determine the qualifying pairs, we find the CPA
distance between the line segment pairs (p[1],q[1]) (p[1],q[2]) (p[1],q[3]) (p[2],q[1])
(p[2],q[2]) (p[2],q[3]) (p[3],q[1]) (p[3],q[2]) (p[3],q[3]) and return the pair with the minimum
distance among all evaluated pairs. The complete code for computing CPA(p,q,d) over
multi-segment trajectories is given as Algorithm 3-1.
Algorithm 1 CPA (Object p, Object q, distance d)1: mindist = ∞2: for (i = 1 to p.size) do3: for (j = 1 to q.size) do4: tmp = CPA Distance (p[i], q[j])5: if (tmp ≤ mindist) then6: mindist = tmp
7: end if8: end for9: end for
10: if (mindist ≤ d) then11: return true12: end if13: return false
In the next two Sections, we consider two obvious alternatives for computing the
CPA Join, where we wish to discover all pairs of objects (p, q) from two relations P and
Q, where CPA (p, q,d) evaluates to true. The first technique we describe makes use of an
underlying R-tree index structure to speed up join processing. The second methodology is
based on a simple plane-sweep.
3.3 Join Using Indexing Structures
Given numerous existing spatiotemporal indexing structures, it is natural to first
consider employing a suitable index to perform the join.
Though many indexing structures exist, unfortunately most are not suitable for the
CPA Join. For example, a large number of indexing structures like the TPR-tree [17],
REXP tree [77], TPR*-tree [78] have been developed to support predictive queries, where
30
the focus is on indexing the future position of an object. However, these index structures
are generally not suitable for CPA Join, where access to the entire history is needed.
Indexing structures like MR-tree [26], MV3R-tree [27], HR-tree [28], HR+-tree [27]
are more relevant since they are geared towards answering time instance queries (in case
of MV3R-tree also short time-interval queries), where all objects alive at a certain time
instance are retrieved. The general idea behind these index structures is to maintain a
separate spatial index for each time instance. However, such indices are meant to store
discrete snapshots of a evolving spatial database, and are not ideal for use with CPA Join
over continuous trajectories.
3.3.1 Trajectory Index Structures
More relevant are indexing structures specific to moving object trajectory histories
like the TB-tree, STR-tree [21] and SETI [18]. TB-trees emphasize trajectory preservation
since they are primarily designed to handle topological queries where access to entire
trajectory is desired (segments belonging to the same trajectory are stored together). The
problem with TB-trees in the context of the CPA Join is that segments from different
trajectories that are close in space or time will be scattered across nodes. Thus, retrieving
segments in a given time window will require several random I/Os. In the same paper
[21], a STR tree is introduced that attempts to somewhat balance spatial locality with
trajectory preservation. However, as the authors point out STR-trees turn out to be a
weak compromise that do not perform better than traditional 3D R-trees [20] or TB-trees.
More appropriate to the CPA Join is SETI [18]. SETI partitions two-dimensional space
statically into non-overlapping cells and uses a separate spatial index for each cell. SETI
might be a good candidate for CPA Join since it preserves spatial and temporal locality.
However, there are several reasons why SETI is not the most natural choice for a CPA
Join:
• It is not clear that SETI’s forest scales to a three-dimensional space. A 25× 25 SETIgrid in two-dimension becomes a sparse 25× 25× 25 grid with almost 20, 000 cells inthree-dimension.
31
• SETI’s grid structure is an interesting idea for addressing problems with highvariance in object speeds (we will use a related idea for the adaptive plane-sweepalgorithm described later). However, it is not clear how to size the grid for a givendata set, and sizing it for a join seems even harder. It might very well be thatrelation R should have a different grid for R ./ S compared to R ./ T .
• For a CPA Join over a limited history, SETI has no way of pruning the search space,since every cell will have to be searched.
3.3.2 R-tree Based CPA Join
Given these caveats, perhaps the most natural choice for the CPA Join is the R-tree
[16]. The R-tree [16] is a hierarchical, multi-dimensional index structure that is commonly
used to index spatial objects. The join problem has been studied extensively for R-trees
and several spatial join techniques exist [45, 46, 79] that leverage underlying R-tree
index structures to speed-up join processing. Hence, our first inclination is to consider a
spatiotemporal join strategy that is based on R-trees. The basic idea is to index object
histories using R-trees and then perform a join over these indices.
The R-Tree Index
It is a very straightforward task to adapt the R-tree to index a history of moving
object trajectories. Assuming three spatial dimensions and a fourth temporal dimension,
the four-dimensional line segments making up each individual object trajectory are simply
treated as individual spatial objects and indexed directly by the R-tree. The R-tree and
its associated insertion or packing algorithms are used to group those line segments into
disk-page sized groups, based on proximity in their four-dimensional space. These pages
make up the leaf level of the tree. As in a standard R-tree, these leaf pages are indexed
by computing the minimum bounding rectangle that encloses the set of objects stored in
each leaf page. Those rectangles are in turn grouped into disk-page sized groups which are
themselves indexed. An R-tree index for 3 line segments moving through 2-dimensional
space is depicted in Figure 3-4.
32
p1[3]p3[2]p1[2] p1[4]p3[3] p2[4]
p3[1]
p2[3]
p3[2]
p3[3]
p1[4]
p2[1]
p2[2]
p1[3]
I1
I2
I3
ty
x
p2[4]
I1 I2 I3
p1[2]
p1[1]
p2[3]
p1[2]
p1[3]
p3[2]
p2[3]
p1[4]p2[4]
p3[3]
p2[2]p3[1]p2[1]p1[1]
p1[1]
p2[1]
p2[2]
p3[1]
Figure 3-4. Example of an R-tree
Basic CPA Join Algorithm Using R-Trees
Assuming that the two spatiotemporal relations to be joined are organized using
R-trees, we can use one of the standard R-tree distance joins as a basis for the CPA Join.
The common approach to joins using R-trees employ carefully controlled synchronized
traversal of the two R-trees to be joined. The pruning power of the R-tree index arises
from the fact that if two bounding rectangles R1 and R2 do not satisfy the join predicate
then the join predicate is not satisfied between any two bounding rectangles that can be
enclosed within R1 or R2.
In a synchronized technique, both the R-trees are simultaneously traversed retrieving
object-pairs that satisfy the join predicate. To begin with, the root nodes of both the
R-trees are pushed into a queue. A pair of nodes from the queue is processed by pairing
up every entry of the first node with every entry in the second node to form the candidate
set for further expansion. Each pair in the candidate set that qualifies the join predicate is
pushed into the queue for subsequent processing. The strategy described leads to a BFS
(Breadth-First-Search) expansion of the trees. BFS-style traversal lends itself to global
optimization of the join processing steps [46] and works well in practice.
33
d2
l1
P2
l2
P1
darbit
d1
x
y
z
dreal
Figure 3-5. Heuristic to speed up distance computation
The distance routine is used in evaluating the join predicate to determine the distance
between two bounding rectangles associated with a pair of nodes. A node-pair qualifies
for further expansion if the distance between the pair is less than the limiting distance d
supplied by the query.
Heuristics to Improve the Basic Algorithm
The basic join algorithm can be improved in several ways by using several standard
and non-standard techniques for reducing I/O and CPU costs over spatial joins. These
include:
• Using a plane-sweeping algorithm [45] to speed up the all-pairs distance computationwhen pairs of nodes are expanded and their children are checked for possiblematches.
• Carefully considering the processing of node pairs so that when each pair isconsidered, one or both of the nodes are in the buffer [46].
• Avoiding expensive distance computations by applying heuristic filters. Computingthe distance between two 3-dimensional rectangles can be a very costly operation,since the closest points may be on arbitrary positions on the faces of the rectangles.To speed this computation, the magnitudes of the diagonals of the two rectangles(d1 and d2) can be computed first. Next, we pick an arbitrary point from both ofthe rectangles (points P1 and P2), and compute the distance between them, calleddarbit. If darbit − d1 − d2 > djoin, then the two rectangles cannot contain any pointsas close as djoin from one another and the pair can be discarded, as shown in Figure3-5. This provides for immediate dismissals with only three distance computations(or one if the diagonal distances are precomputed and stored with each rectangle).
34
object p
object q
ty
x
Figure 3-6. Issues with R-trees- Fast moving object p joins with everyone
In addition, there are some obvious improvements to the algorithm that can be made
which are specific to the 4-dimensional CPA Join:
• The fourth dimension, time, can be used as an initial filter. If two MBRs or linesegments do not overlap in time, then the pair cannot possibly be a candidate for aCPA match.
• Since time can be used to provide for immediate dismissals without Euclideandistance calculations it is given priority over the attributes. For example, when aplane-sweep is performed to prune an all-pairs CPA distance computation, time isalways chosen as the sweeping axis. The reason is that time will usually have thegreatest pruning power of any attributes since time-based matches must always beexact, regardless of the join distance.
• In our implementaion of the CPA Join for R-trees, we make use of the STR packingalgorithm [80] to build the trees. Because the potential pruning power of the timedimension is greatest, we ensure that the trees are well-organized with respect totime by choosing time as the first packing dimension.
Problem With R-tree CPA Join
Unfortunately, it turns out that in practice the R-tree can be ill-suited to the problem
of computing spatiotemporal joins over moving object histories. R-trees have a problem
handling databases with a high variance in object velocities. The reason for this is that
join algorithms which make use of R-trees rely on tight and well-behaved minimum
bounding rectangles to speed the processing of the join. When the positions of a set of
moving objects are sampled at periodic intervals, fast moving objects tend to produce
larger bounding rectangles than slow moving objects.
35
p1
q2
q1
p2
y
time
tendtstart
Figure 3-7. Progression of plane-sweep
One such scenario is depicted in Figure 3-6, which shows the paths of a set of objects
on a 2-D plane for a given time period. A fast moving object such as p will be contained
in a very large MBR, while slower objects such as q will be contained in much smaller
MBRs. When a spatial join is computed over R-trees storing these MBRs, the MBR
associated with p can overlap many smaller MBRs, and each overlap will result in an
expensive distance computation (even if the objects do not travel close to one another).
Thus, any sort of variance in object velocities can adversely affect the performance of the
join.
3.4 Join Using Plane-Sweeping
The second technique that is considered is a join strategy based on a simple plane-
sweep. Plane-sweep is a powerful technique for solving proximity problems involving
geometric objects in a plane and has previously been proposed [49] as a way to efficiently
compute the spatial join operation.
3.4.1 Basic CPA Join using Plane-Sweeping
Plane-sweep is an excellent candidate for use with the CPA join because no matter
what distance threshold is given as input into the join, two objects must overlap in the
time dimension for there to be a potential CPA match. Thus, given two spatiotemporal
relations P and Q, we could easily base our implementation of the CPA join on a
plane-sweep along the time dimension.
36
We would begin a plane-sweep evaluation of the CPA join by first sorting the intervals
making up P and Q along the time dimension, as depicted in Figure 3-7. We then sweep
a vertical line along the time dimension. A sweepline data structure D is maintained
which keeps track of all line segments which are valid given the current position of the line
along the time dimension. As the sweepline progresses, D is updated with insertions (new
segments that became active) and deletions (segments whose time period has expired).
Segment pairs from both input relations that satisfy the join predicate are always present
in D, and they can be checked and reported during updates to D. Pseudo-code for the
algorithm is given below:
Algorithm 2 PlaneSweep (Relation P, Relation Q, distance d)
1: Form a single list L containing segments from P and Q sorted by tstart
2: Initialize sweepline data structure D3: while not IsEmpty (L) do4: Segment top = popFront (L)5: Insert (D, top)6: Delete from D all segments s s.t. (s.tend < top.tstart) {remove segments that donot
intersect sweepline}7: Query (D, top, d) {report segments in D that are within distance dist}8: end while
In the case of the CPA join, assuming that all moving objects at any given moment
can be stored in main memory, any of a number of data structures can be used to
implement D, such as a quad- or oct-tree, or an interval skip-list [81]. The main
requirement is that the data structure selected should easily be possible to check proximity
of objects in space.
3.4.2 Problem With The Basic Approach
Although the plane-sweep approach is simple, in practice it is usually too slow to
be useful for processing moving object queries. The problem has to do with how the
sweepline progression takes place. As the sweepline moves through the data space, it has
to stop momentarily at sample points (time instances at which object positions where
recorded) to process newly encountered segments into the data structure D. New segments
37
y
time
tend
q2
q1
p2
p1
tstart
Figure 3-8. Layered Plane-Sweep
that are encountered at the sample point are added into the data structure and segments
in D that are no longer active are deleted from it.
Consequently, the sweepline pauses more often when objects with high sampling rates
are present, and the progress of the sweepline is heavily influenced by the sampling rates
of the underlying objects. For example, consider Figure 3-7 which shows the trajectory
of four objects in a given time period. In the case illustrated, object p2 controls the
progression of the sweepline. Observe that in the time-interval [tstart, tend], only new
segments from object p2 get added to D but expensive join computations are performed
each time with same set of line segments.
The net result is that if the sampling rate of a data set is very high relative to the
amount of object movement in the data set, then processing a multi-gigabyte object
history using a simple plane-sweeping algorithm may take a prohibitively long time.
3.4.3 Layered Plane-Sweep
One way to address this problem is to reduce the number of segment level comparisons
by comparing the regions of movement of various objects at a coarser level. For example,
reconsider the CPA join depicted in Figure 7. If we were to replace the many oscillations
of object p2 with a single minimum bounding rectangle which enclosed all of those
oscillations from tstart to tend, we could then use that rectangle during the plane-sweep
38
as an intial approximation to the path of object p2. This would potentially save many
distance computations.
This idea can be taken to its natural conclusion by constructing a minimum bounding
box that encompasses the line-segments of each object. A plane-sweep is then performed
over the bounding boxes, and only qualifying boxes are expanded further. We refer to this
technique as the Layered Plane-Sweep approach since plane-sweep is performed at two
layers – one at a coarser level of bounding boxes and then at the finer level of individual
line segments.
One issue that must be considered is how much movement is to be summarized
within the bounding rectangle for each object. Since we would like to eliminate as many
comparisons as possible, one natural choice would be to let the available system memory
dictate how much movement is covered for each object. Given a fixed buffer size, the
algorithm will proceed as follows.
Algorithm 3 LayeredPlaneSweep(Relation P , Relation Q, distance d)
1: Segment s defined by [(xsart, xend), (ystart, yend), (zstart,zend), (tstart, tend)]2: Assume a sorted list of object segments (by tstart) in disk3: while there is still some unprocessed data do4: Read in enough data from P and Q to fill the buffer5: Let tstart be the first time tick which has not yet been processed by the plane-sweep6: Let tend be the last time tick for which no data is still on disk7: Next, bound the trajectory of every object present in the buffer by a MBR8: Sort the MBRs along one of the spatial dimension and then perform a plane-sweep
along that dimension9: Expand the qualifying MBR pairs to get the actual trajectory data (line segments)
10: Sort the line segments by tstart
11: Perform a final sweep along the time dimension to get the final result set12: end while
Figure 3-8 illustrates the idea. It depicts the snapshot of object trajectories starting
at some time instance tstart. Segments in the interval [tstart, tend] represent the maximum
that can be buffered in the available memory. A first level plane-sweep is carried out over
the bounding boxes to eliminate false positives. Qualifying objects are expanded and a
second-level plane-sweep is carried out over individual line-segments. In the best case,
39
tstart
q2q2
tend
time
q1
p2
p1
y
Figure 3-9. Problem with using large granularities for bounding box approximation
there is an opportunity to process the entire data set through just three comparisons at
the MBR level.
3.5 Adaptive Plane-Sweeping
While the layered plane-sweep typically performs far better than the basic plane-sweeping
algorithm, it may not always choose the proper level of granularity for the bounding box
approximations. This Section describes an adaptive strategy that takes into careful
consideration the underlying object interaction dynamics and adjusts this granularity
dynamically in response to the underlying data characteristics.
3.5.1 Motivation
In the simple layered plane-sweep, the granularity for the bounding box approximation
is always dictated by the available system memory. The assumption is that pruning power
increases monotonically with increasing granularity. Unfortunately, this is not always the
case. As a motivating example, consider Figure 3-9. Assume available system memory
allows us to buffer all the line segments. In this case, the layered plane-sweep performs
no better than the basic plane-sweep, due to the fact that all the object bounding
boxes overlap with each other and as a result no pruning is achieved at the first-level
plane-sweep.
However, assume we had instead fixed the granulatiry to correspond to the time
period [tstart, ti], as depicted in Figure 3-10. In this case, none of the bounding boxes
40
overlap, and there are possibly many dismissals at the first level. Though less of the
buffer is processed intitially, we are able to eliminate many of the segment-level distance
comparisons compared to a technique that bounds the entire time-period, thereby
potentially increasing the efficiency of the algorithm. The entire buffer can then be
processed in a piece-by-piece fashion, as depicted in Figure 3-10. In general, the efficiency
of the layered plane-sweep is tied not to the granularity of the time interval that is
processed, but the granularity that minimizes the number of distance comparisons.
3.5.2 Cost Associated With a Given Granularity
Since distance computations dominate the time required to compute a typical CPA
Join, the cost associated with a given granularity can be approximated as a function of the
number of distance comparisons that are needed to process the segments encompassed in
that granularity. Let nMBR be the number of distance computations at the box-level, let
nseg be the number of distance calculations at the segment-level, and let α be the fraction
of the time range in the buffer which is processed at once. Then the cost associated with
processing that fraction of the buffer can be estimated as:
costα = (nseg + nMBR)(1− α)
This function reflects the fact that if we choose a very small value for α, we will have to
process many cut-points in order to process the entire buffer, which can increase the cost
of the join. As α shrinks, the algorithm becomes equivalent to the traditional plane-sweep.
On the other hand, choosing a very large value for α tends to increase (nseg + nMBR),
eventually yielding an algorithm which is equivalent to the simple, layered plane-sweep. In
practice, the optimal value for α lies somewhere in between the two extremes, and varies
from data set to data set.
3.5.3 The Basic Adaptive Plane-Sweep
Given this cost function, it is easy to design a greedy plane-sweep algorithm that
attempts to repeatedly minimize costα in order to adapt the underlying (and potentially
41
tendtj
p1
p2
q1
q2
time
y
titstart
Figure 3-10. Adaptively varying the granularity
time-varying) characteristics of the data. At every iteration, the algorithm simply chooses
to process the fraction of the buffer that appears to minimize the overall cost of the
plane-sweep in terms of the expected number of distance computations. The algorithm is
given below:
Algorithm 4 AdaptivePlaneSweep(Relation P , Relation Q, distance d)
1: while there is still some unprocessed data do2: Read in enough data from P and Q to fill the buffer3: Let tstart be the first time tick which has not yet been processed by the plane-sweep4: Let tend be the last time tick for which no data is still on disk5: Choose α so as to minimize costα6: Perform a layered plane-sweep from time tstart to tstart + α × (tend − tstart) {steps 5-9
of procedure LayeredPlaneSweep}7: end while
Unfortunately, there are two obvious difficulties involved with actually implementing
the above algorithm:
• First, the cost costα associated with a given granularity is known only after thelayered plane has been executed at that granularity.
• Second, even if we can compute costα easily, it is not obvious how we can computecostα for all values of α from 0 to 1 so as to minimize costα over all α.
These two issues are discussed in detail in the next two Sections.
42
3.5.4 Estimating Costα
This Section describes how to efficiently estimate costα for a given α using a simple,
online sampling algorithm reminiscent of the algorithm of Hellerstein and Haas [82].
At a high level, the idea is as follows. To estimate costα, we begin by constructing
bounding rectangles for all of the objects in P considering their trajectories from time
tstart to α(tend− tstart). These rectangles are then inserted into an in-memory index, just as
if we were going to perform a layered plane-sweep. Next, we randomly choose an object q1
from Q, and construct a bounding box for its trajectory as well. This object is joined with
all of the objects in P by using the in-memory index to find all bounding boxes within
distance d of q1. Then:
• Let nMBR,q1 be the number of distance computations needed by the index tocompute which objects from P have bounding rectangles within distance d of thebounding rectangle for q1, and
• Let nseg,q1 be the total number of distance computations that would have beenneeded to compute the CPA distance between q1 and every object p ∈ P whosebounding rectangle is within distance d of the bounding rectangle for q1 (this can becomputed efficiently by performing a plane-sweep without actually performing therequired distance computations).
Once nMBR,q1 and nseg,q1 have been computed for q1, the process can be repeated for a
second randomly selected object q2 ∈ Q, for a third object q3 and so on. A key observation
is that after m objects from Q have been processed, the value
µm =1
m
m∑i=1
(nMBR,q1 + nseg,q1)|Q|
represents an unbiased estimator for (nMBR + nseg) at α, where |Q| denotes the number of
data objects in Q.
In practice, however, we are not only interested in µm. We would also like to know at
all times just how accurate our estimate µm is, since at the point where we are satisfied
with our guess as to the real value of costα, we want to stop the process of estimating
costα and continue with the join.
43
Fortunately, the central limit theorem can easily be used to estimate the accuracy
of µm. Assuming sampling with replacement from Q, for large enough m the error of our
estimate will be normally distributed around (nMBR + nseg) with variance σ2m = 1
mσ2(Q),
where σ2(Q) is defined as
1
|Q||Q|∑i=1
{(nMBR,q1 + nseg,q1)|Q| − (nMBR + nseg)}2
Since in practice we cannot know σ2(Q), it must be estimated via the expression
σ2(Qm) =1
m− 1
m∑i=1
{(nMBR,q1 + nseg,q1)|Q| − µm}
(Qm denotes the sample of Q that is obtained after m objects have been randomly
retrieved from Q). Substituting into the expression for σ2m, we can treat µm as a normally
distributed random variable with variance bσ2(Qm)m
.
In our implementation of the adaptive plane-sweep, we can continue the sampling
process until our estimate for costα is accurate to within ±10% at 95% confidence. Since
95% of the standard normal distribution falls within two standard deviations of the mean,
this is equivalent to sampling until√bσ2(Qm)
mis less than µm × 0.1.
3.5.5 Determining The Best Cost
We now address the second issue: how to compute costα for all values of α from 0 to
1 so as to minimize costα over all α.
Calculating costα for all possible values of α is prohibitively expensive and hence not
feasible in practice. Fortunately, in practice we do not have to evaluate all values of α to
determine the best α. This is due to the following interesting fact: “If we plot all possible
values of α and their respective associated cost, we would observe that the graph is not
linear, but exhibits a certain concavity. The concave region of the graph represents a sweet
spot and represents the feasible region for the best cost.”
As an example consider Figure 3-11, which shows the plot of the cost function for
various fractions of α for one of the experimental data sets from Section 3.6. Given this
44
2e+07
3e+07
4e+07
5e+07
6e+07
7e+07
8e+07
9e+07
0 10 20 30 40 50 60 70 80 90 100#
of D
ista
nce
Com
puta
tions
(es
timat
ed)
% of buffer
Convexity of Cost Function for k = 20
Figure 3-11. Convexity of cost function illustration.
fact, we identify the feasible region by evaluating costαifor a small number, k, of αi
values. Given k the number of allowed cutpoints, the fraction α1 can be determined as
follows:
α1 =r( 1
k)
r
where r = (tend − tstart) is the time range described by the buffer (the above formula
assumes that r is greater than one; if not, then the time range should be scaled accordingly).
In the general case, the fraction of the buffer considered by any αi(1 ≤ i ≤ k) is given by:
αi =(r · α1)
i
r
Note that since the growth rate of each subsequent αi is exponential, we can cover
the entire buffer with just a small k and still guarantee that we will consider some value
of αi that is within a factor of α1 from the optimal. After computing α1, α2,. . . ,αk, we
successively evaluate increasing buffer fractions, α1, α2, α3, and so on and determine their
associated costs. From these k costs we determine the αi with the minimum cost.
Note that if we choose α based on the evaluation of a small k, then it is possible that
the optimal choice of α may lie outside the feasible region. However, there is a simple
approach to solving this issue. After an initial evaluation of k granularities, consider
just the region starting before and ending after the best k and recursively reapply the
evaluation described above just in this region.
45
α3α2
α4 α5
α5
α3
tj
ti
tendtstart
α5
mincost
mincost
mincost
α3α2α1
α4
α4
α1
α1
α2
Figure 3-12. Iteratively evaluating k cut points
For instance, assume we chose αi after evaluation of k cutpoints in the time range r.
To further tune this αi, we consider the time range defined between the adjacent cutpoints
αi−1 and αi+1 and recursively apply cost estimation in this interval. (i.e., evaluate k points
in the time range (tstart + αi−1 × r, tstart + αi+1 × r)). Figure 3-12 illustrates the idea. This
approach is simple and very effective in considering a large number of choices of α.
3.5.6 Speeding Up the Estimation
Restricting the number of candidate cut points can help keep the time required to
find a suitable value for α manageable. However, if the estimation is not implemented
carefully, the time required to consider the cost at each of the k possible time periods can
still be significant.
The most obvious method for estimating costα for each of the k granularities would
be to simply loop through each of the associated time periods. For each time period,
we would build bounding boxes around each of the trajectories of the objects in P , and
then sample objects from Q as described in Section 3.5 until the cost was estimated with
sufficient accuracy.
However, this simple algorithm results in a good deal of repeated work for each time
period, and can actually decrease the overall speed of the adaptive plane-sweep compared
to the layered plane-sweep. A more intelligent implementation can speed the optimization
process considerably.
In our implementation, we maintain a table of all the objects in P and Q, organized
on the ID of each object. Each entry in the table points to a linked list that contains a
46
chain of MBRs for the associated object. Each MBR in the list bounds the trajectory
of the object for one of the k time-periods considered during the optimization, and the
MBRs in each list are sorted from the coarsest of the k granularities to the finest. The
data structure is depicted in Figure 3-12.
Given this structure, we can estimate costα for each of the k values of alpha in
parallel, with only a small hit in performance associated with an increased value for k.
Any object pair (p ∈ P, q ∈ Q) that needs to be evaluated during the sampling process
described in Section 3.5 is first evaluated at the coarsest granularity corresponding to αk.
If the two MBRs are within distance d of one another, then the cost estimate for αk is
updated, and the evaluation is then repeated at the second coarsest granularity αk−1. If
there is again a match, then the cost estimate for αk−1 is updated as well. The process is
repeated until there is not a match. As soon as we find a granularity at which the MBRs
for p and q are not within a distance d of one another, then we can stop the process,
because if the MBRs for P and Q are not within distance d for the time period associated
with αi, then they cannot be within this distance for any time period αj where j < i.
The benefit of this approach is that in cases where the data are well-behaved and
the optimization process tends to choose a value for α that causes the entire buffer to be
processed at once, a quick check of the distance between the outer-most MBRs of p and q
is the only geometric computation needed to process p and q, no matter what value of k is
chosen.
The bounding box approximations themselves can be formed while the system buffer
is being filled with data from disk. As trajectory data are being read from disk, we grow
the MBRs for each αi progressively. Since each αi represents a fraction of the buffer, the
updates to its MBR can be stopped as soon as that much fraction of the buffer has been
filled. Similar logic can be used to shrink the MBRs when some fraction of the buffer is
consumed and expand it when the buffer is refilled.
47
.
.
.
α3 α4 α5
p2
p1
α2time
y
pn
p2
(α5) (α4) (α3) (α2) (α1)MBRMBRMBR MBRMBR
α1
Figure 3-13. Speeding up the Optimizer
3.5.7 Putting It All Together
In our implementation of the adaptive plane-sweep, data are fetched in blocks
and stored in the system buffer. Then an optimizer routine is called which evaluates
k granularities and returns the granularity α with the minimum cost. Data in the
granularity chosen by the optimizer is then evaluated using the LayeredPlaneSweep
routine (procedure described in Section 3.4). When the LayeredPlaneSweep routine returns,
the buffer is refilled and the process is repeated. The techniques described in the previous
Section are utilized to make the optimizer implementation fast and efficient.
3.6 Benchmarking
This section presents experimental results comparing the various methods discussed so
far for computing a spatiotemporal CPA Join: an R-tree, a simple plane-sweep, a layered
plane-sweep, and an adaptive plane-sweep with several parameter settings. The Section is
organized as follows. First, a description of the three, three-dimensional temporal data sets
used to test the algorithms is given. This is followed by the actual experimental results
and a detailed discussion analyzing the experiemental data.
3.6.1 Test Data Sets
The first two data sets that we use to test the various algorithms result from two
physics-based, N -body simulations. In both data sets, constituent records occupy 80B on
disk (80B is the storage required to record the object ID, time information, as well as the
48
0 500
1000 1500
2000 2500
3000 3500
-1500-1000
-500 0
500 1000
1500
-1500
-1000
-500
0
500
1000
1500
Figure 3-14. Injection data set at time tick 2,650
position and motion of the object). The size of each data set is around 50 gigabytes each.
The two data sets are as follows:
1. The Injection data set. This data set is the result of a simulation of the injection oftwo gasses into a chamber through two nozzles on the opposite sides of the chambervia the depression of pistons behind each of the nozzles. Each gas cloud is treated asone of the input relations to the CPA-join. In addition to heat energy transmittedto the gas particles via the depression of the pistons, the gas particles also have anattractive charge. The purpose of the join is to determine the speed of the reactionresulting from the injection of the gasses into the chamber, by determining thenumber of (near) collisions of the gas particles moving through the chamber. Bothdata sets consist of 100,000 particles, and the positions of the particles are sampledat 3,500 individual time ticks, resulting in two relations that are around 28 gigabytesin size each. During the first 2,500 time ticks, for the most part both gasses aresimply compressed in their respective cylinders. After tick 2,500, the majority of theparticles begin to be ejected from the two nozzles. A small sample of the particles inthe data set is depicted above in Figure 3-13, at time tick 2,650.
2. The Collision data set. This data set is the result of an N -body gravitationalsimulation of the collision of two small galaxies. Again, both galaxies contain around100,000 star systems, and the positions of the systems in each galaxy are polled at3,000 different time ticks. The size of the relations tracking each galaxy is around 24gigabytes each. For the first 1,500 or so time ticks, the two galaxies merely approachone another. For the next thousand time ticks, there is an intense interaction asthey pass through one another. During the last few hundred time ticks, there is lessinteraction as the two galaxies have larglely gone through one another. The purposeof the CPA Join is to find paris of galaxies that apprached closely enough to have a
49
-10000-5000
0 5000
10000-12000-10000
-8000-6000
-4000-2000
0 2000
-6000
-4000
-2000
0
2000
4000
6000
8000
Figure 3-15. Collision data set at time tick 1,500
strong graviational interaction. A small sample of the galaxies in the simulation isdepicted above in Figure 3-14, at time tick 1,500.
In addition, we test a third data set created using a simple, 3-dimensional random walk.
We call this the Synthetic data set (this data set was again about 50GB in size). The
speed of the various objects varies considerably during the walk. The purpose of including
this data is to rigorously test the adpatability of the adaptive plane-sweep, by creating
a synthetic data set where there are significant fluctuations in the amount of interaction
among objects as a function of time.
3.6.2 Methodology and Results
All experiments were conducted on a 2.4GHz Intel Xeon PC with 1GB of RAM. The
experimental data sets were each stored on an 80GB, 15,000 RPM Seagate SCSI disk.
For all three of the data sets, we tested an R-tree-based CPA Join (implemented as
described in Section 3.3; we used the STR R-tree packing algorithm [80] to construct an
R-tree for each input relation), a simple plane-sweep (implemented as described in Section
3.4), a layered plane-sweep (implemented as described in Section 3.5).
We also tested the adaptive plane-sweep algorithm, implemented as described
in Section 6. For the adaptive plane-sweep, we also wanted to test the effect of the
50
0
2000
4000
6000
8000
10000
12000
0 20 40 60 80 100
Tim
e T
aken
% of Join Completed
CPA-Join over Injection Dataset
R-Tree
Simple Sweep
Layered Sweep
Adaptive Sweep(k=20)w/ addl. recursive call
Adaptive Sweep(k=5)w/ addl. recursive call
Adaptive Sweep(k=10)w/ addl. recursive call
Figure 3-16. Injection data set experimental results
two relevant parameter settings on the efficiency of the algorithm. These settings are
the number of cut-points k considered at each level of the optimization performed by
the algorithm, as well as the number of recursive calls made to the optimizer. In our
experiments, we used k values of 5, 10, and 20, and we tested using either a single or no
recursive calls to the optimizer.
The results of our experiments are plotted above in Figures 3-15 through 3-20.
Figures 3-15, 3-16, and 3-19 show the progress of the various algorithms as a function
of time, for each of the three data sets (only Figure 3-15 depicts the running time of the
adaptive plane-sweep making use of a recursive call to the optimizer). For the various
plane-sweep-based joins, the x-axis of the two plots shows the percentage of the join that
has been completed, while the y-axis shows the wall-clock time required to reach that
point in the completion of the join. For the R-tree-based join (which does not progress
through virtual time in a linear fashion) the x-axis shows the fraction of the MBR-MBR
pairs that have been evaluated at each particular wall-clock time instant. These values are
normalized so that they are comparable with the progress of the plane-sweep-based joins.
51
0
2000
4000
6000
8000
10000
12000
0 20 40 60 80 100
Tim
e T
aken
% of Join Completed
CPA-Join over Collision Dataset
R-Tree
Simple Sweep Layered Sweep
Adaptive Sweep(k = 20)
Adaptive Sweep(k = 10)Adaptive Sweep(k = 5)
Figure 3-17. Collision data set experimental results
Figures 3-17, 3-18 and 3-20 show the buffer-size choices mades by the adaptive
plane-sweeping algorithm using k = 20 and no recursive calls to the optimizer , as a
function of time for all the three test data sets.
3.6.3 Discussion
On all three data sets, the R-tree was clearly the worst option. The R-tree indices
were not able to meaningfully restrict the number of leaf-level pairs that needed to be
expanded during join processing. This results in a huge number of segment pairs that
must be evaluated. It may have been possible to reduce this cost by using a smaller page
size (we used 64KB pages, a reasonable choice for a modern 15,000 RPM hard disk with a
fast sequential transfer rate), but reducing the page size is a double-edged sword. While it
may increase the pruning power in the index and thus reduce the number of comparisons,
it may also increase the number of random I/Os required to process the join, since there
will be more pages in the structure. Unfortunately, however, it is not possible to know
the optimal page size until after the index is created and the join has been run, a clear
weakness of the R-tree.
The standard plane-sweep and the layered plane-sweep performed comparably on the
three data sets we tested, and both far outperformed the R-tree. It is interesting to note
that the standard plane-sweep performed well when there was much interaction among
52
10
20
30
40
50
60
70
80
90
100
0 500 1000 1500 2000 2500 3000 3500
% o
f buf
fer
cons
umed
Virtual timeline in the dataset
Injection Dataset - Buffer Choices by Optimizer (k = 20)
Figure 3-18. Buffer size choices for Injection data set
30
40
50
60
70
80
90
100
0 500 1000 1500 2000 2500 3000
% o
f buf
fer
cons
umed
Virtual timeline in the dataset
Collision Dataset - Buffer Choices by Optimizer (k = 20)
Figure 3-19. Buffer size choices for Collision data set
the input relations (when the gasses are expelled from the nozzles in the Injection data
set and when the two galaxies overlap in the Collision data set). During such periods
it makes sense to consider only very small time periods in order to reduce the number
of comparisons, leading to good efficiency for the standard plane-sweep. On the other
hand, during time periods when there was relatively little interaction between the input
relations, the layered plane-sweep performed far better because it was able to process large
time-periods at once. Even when the objects in the input relations have very long paths
during such periods, the input data were isolated enough that there tends to be little cost
associated with checking these paths for proximity during the first level of the layered
plane-sweep. The adaptive plane-sweep was the best option by far for all the three data
53
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 20 40 60 80 100
Tim
e T
aken
% of Join Completed
CPA-Join over Synthetic Dataset
R-Tree
Layered Sweep
Simple Sweep
Adaptive Sweep(k=20)
Adaptive Sweep(k=10)
Adaptive Sweep(k=5)
Figure 3-20. Synthetic data set experimental results
sets, and was able to smoothly transition from periods of low to high activity in the data
and back again, effectively picking the best granularity at which to compare the paths of
the objects in the two input relations.
From the graphs, we can see that the cost of performing the optimization causes the
adaptive approach to be slightly slower than the non-adaptive approach when optimization
is ineffective. In both the data sets, this happens in the beginning when the objects
are moving towards each other but still far enough that no interaction takes place. As
expected, in both the experiments, adaptivity begins to take effect when the objects in
the underlying data set start interacting. From Figures 3-17, 3-18, and 3-20 it can be
seen that the buffer size choices made by the adaptive plane-sweep is very finely tuned to
the underlying object interaction dynamics (decreasing with increasing interaction and
vice versa). In both the Injection and Collision data sets, the size of the buffer falls
dramatically just as the amount of interaction between the input relations increases. In
the Synthetic data set, the oscillations in buffer usage depicted in Figure 20 mimic almost
exactly the energy of the data as they perform their random walk.
The graphs also show the impact of varying the parameters to the adaptive
plane-sweep routine, namely, the number of cut points k, considered at each level of
the optimization, and whether or not a chosen granularity is refined through recursive
54
calls to the optimizer. It is surprising to note that varying these parameters causes no
significant changes in the granularity choices made by the optimizer. The reason is that
with increasing interaction in the underlying data set, the optimizer has a preference
towards smaller granularites and these granularites are naturally considered in more detail
due to the logarithmic way in which the search space is partitioned.
Another interesting observation is that the recursive call does not improve the
performance of the algorithm, for two reasons. First, since each invokation of the
optimization is a separate attempt to find the best cut point in a different time range,
it is not possible to share work among the recursive calls. Second, it is likely that just
being in the feasible region, or at least a region close to it is enough to enjoy significant
performance gains. Since the coarse first-level optimization already does that, further
optimizations in terms of recursive calls to tune the chosen granularity do not seem to be
necessary.
In all of our experiments, k = 5 with no recursive call to the optimizer uniormly
gave the best performance. However, if the nature of the input data set is unknown
and the data may be extremely poorly behaved, then we believe a choice of k = 10
with one recursive call may be a safer, all-purpose choice. On one hand, the cost of
optimization will be increased, which may lead to a greater execution time in most cases
(our experiments showed about a 30% performance hit associated with using k = 10 and
one recusrive call compared to k = 5). However, the benefit of this choice is that it is
highly unlikely that such a combination would miss the optimal buffer choice in a very
difficult scenario with a highly skewed data set.
3.7 Related Work
To our knowledge, the only prior work which has considered the problem of
computing joins over moving object histories is the work of Jeong et al. [51]. However,
their paper considers the basic problem at a high level. The algorithmic and implementation
issues addressed by our own work were not considered.
55
10
20
30
40
50
60
70
80
90
100
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
% o
f buf
fer
cons
umed
Virtual timeline in the dataset
Synthetic Dataset - Buffer Choices by Optimizer (k = 20)
Figure 3-21. Buffer size choices for Synthetic data set
Though little work has been reported on spatiotemporal joins, there has been a
wealth of research in the area of spatial joins. The classical paper in spatial-joins is due
to Brinkhoff, Kreigel and Seeger[45] and is based on the R-tree index structure. An
improvement of this work was given by Huang et al.[46]. Hash-based spatial join strategies
have been suggested by Lo and Ravishankar [47], and Patel and Dewitt [48]. Lars et al.
[49] proposed a plane-sweep approach to address the spatial-join problem in the absence of
underlying indexes.
Within the context of moving objects, research has been focused on two main areas:
predictive queries, and historical queries. Within this taxonomy, our work falls in the
latter category. In predictive queries, the focus is on the future position of the objects
and only a limited time window of the object positions need to be maintained. On the
other hand, for historical queries, the interest is on efficient retrieval of past history and
usually the index structure maintains the entire timeline of an object’s history. Due to
these divergent requirements, index structures designed for predictive queries are usually
not suitable for historical queries.
A number of index structures have been proposed to support predictive and historical
queries efficiently. These structures are generally geared towards efficiently answering
56
selection or window queries and do not study the problem of joins involving multi-gigabyte
object histories addressed by our work.
Index structures for historical queries include the 3D-R-trees[20], spatiotemporal
R-Trees and TB (Trajectory Bounding)-trees [21], and linear quad-trees [22] . A technique
based on space partitioning is reported in [18]. For predictive queries, Saltenis et al.[17]
proposed the TPR-tree (time-parametrized R-tree) which indexes the current and
predicted future positions of moving point objects. They mention the sensitivity of
the bounding boxes in the R-tree to object velocities. An improvement of the TPR-tree
can be found in [78]. In [76], a framework to cover time-parametrized versions of spatial
queries by reducing them to nearest-neighbor search problem has been suggested. In [23],
an indexing technique is proposed where trajectories in a d-dimensional space is mapped
to points in higher-dimensional space and then indexed. In [75], the authors propose a
framework called SINA in which continuous spatiotemporal queries are abstracted as a
spatial join involving moving objects and moving queries. An overview of different access
structures can be found in [83].
3.8 Summary
This chapter explored the challenging problem of computing joins over massive
moving object histories. We compared and evaluated obvious join strategies and
then described a novel join technique based on an extension to the plane-sweep. The
benchmarking results suggest that the proposed adaptive technique offers significant
benefits over the competing techniques.
57
CHAPTER 4ENTITY RESOLUTION IN SPATIOTEMPORAL DATABASES
Sensor networks are steadily growing larger and more complex, as well as more
practical for real-world applications. It is not hard to imagine a time in the near future
where huge networks will be widely deployed. A key data mining challenge will be making
sense of all of the data that those sensors produce.
In this chapter, we study a novel variation of the classic entity resolution problem
that appears in sensor network applications. In entity resoluition, the goal is to determine
whether or not various bits of data pertain to the same object. For example, two records
from different data sources that have been loaded into a database may reference the names
“Joe Johnson” and “Joseph Johnson”. Entity resolution methodologies may be useful in
determining if these records refer to the same person.
This problem also appears in sensor networks. A large number of sensors may all
produce data relating to the same object or event, and it becomes necessary to be able
to determine when this is the case. Unfortunately, the problem is exceptionally difficult
in a sensor application, for two primary reasons. First, sensor data is often not as rich
as classical, relational data, and often gives far fewer cluues as to when two observations
correspond to the same oject. The most extreme case is a simple motion sensor, which
will simply report a “yes” indicating that motion was sensed, along with a timestamp and
an approximate location. This provides very little information to make use of during the
resolution process. Second, there is the large number of data sources. The largest sensor
networks in existence today already contain on the order of one thousand sensors. This
goes far beyond what one might expect in a classical entity resolution application.
In this chapter, we consider a specific version of the entity resolution problem
that appears in sensor networks, where the goal is monitoring a spatial area in order
to understand the motion of objects through the area. Given a large database of
spatiotemporal sensor observations that consist of (location, timestamp) pairs, our
58
goal is to perform an accurate segmentation of all of the observations into sets, where each
set is associated with one object. Each set should also be annotated with the path of the
object through the area.
We develop a statistical, learning-based approach to solving the problem. We first
learn a model for the motion of the various objects through the spatial area, and then
associate sensor observations with the objects via a maximum likelihood procedure. The
major technical difficulty lies in using the sensor observations to learn the spatial motion
of the objects. A key aspect of our approach is that we make use of two different motion
models to develop a reasonable solution to this problem: a restricted motion model that is
easy to learn and yet is only applicable to smooth motion over small time periods, and a
more general model that takes as input the initial motion learned via the restricted model.
Some specific contributions of this work are as follows:
1. A unique expectation-maximization (EM) algorithm that is suitable for learningassociations of spatiotemporal, moving object data is described. This algorithmallows us to recognize quadratic (fixed acceleration) motion in a large set of sensorobservations.
2. We apply and extend the method of Bayesian filters for recognizing unrestrictedmotion to the case when a large number of interacting objects produce data, and itis not clear which observation corresponds to which object.
3. Experimental results show that the proposed method can accurately performresolution over more than one hundred simultaneuously moving objects, even whenthe number of moving objects is not known beforehand.
The remainder of this chapter is organized as follows: In the next section, we
state the problem formally and give an overview of our approach. We then describe
the generative model and define the PDFs for the restricted and unrestricted motion.
This is followed by a detailed description of the learning algorithms in section 4.4.
An experimental evaluation of the algorithms is given in section 4.5 followed by the
conclusion.
59
4.1 Problem Definition
Consider a large database of sensor observations associated with K objects moving
through some monitored area. Each observation is a potentially inaccurate ordered pair of
the form (x, t) where x is the position vector and t is the time instance of the observation.
Given a set of object IDs O = {o1, o2, ..oK}, the entity resolution problem that we
consider consists of associating with each observation a label oi ∈ O such that all sensor
observations labeled oi were produced by the same object. That is, we are partitioning the
observations into sets of K classes where each class of observations represents the path of
a moving object on the field.
(4,11,1)
(6,14,2)
(8,17,3)
(9,9,0)
(11,7,1)
(13,5,2)
(15,3,3)
(2,8,0)
(4,11,1)
(6,14,2)
(8,17,3)
(9,9,0)
(11,7,1)
(13,5,2)
(15,3,3)
(2,8,0)
(b)(a)
target 2
target 1
Figure 4-1. Mapping of a set of observations for linear motion
As an example, consider the set of observations: {(2,8,0) (9,9,0) (4,11,1) (11,7,1)
(6,14,2) (13,5,2) (8,17,3) (15,3,3)} as shown in Figure 4-1(a). Given the underlying
motion is linear and K = 2, Figure 4-1(b) shows a mapping of the observations with
objects. Observations {(2,8,0) (4,11,1) (6,14,2) (8,17,3)} are associated with object 1, and
observations {(9,9,0) (11,7,1) (13,5,2) (15,3,3)} are associated with object 2. Though in
this case the classification was easy, the problem in general is hard due to a number of
factors, including:
• Paths traced by real-life objects tend to be irregular and often cannot be approximatedwith simple curves.
• The measurement process is not accurate and often introduces error in theobservations, which needs to be taken into account in classification.
60
• Objects can cross paths, or track one another’s paths for realatively long timeperiods, complicating both the segmentation and the problem of figuring out howmany ojects are under observation.
4.2 Outline of Our Approach
In order to solve this problem in the general case, we make use of a model-oriented
approach. We model the uncertainty inherent in the production of sensor observations by
assuming that the observations are produced by a generative, random process.
The location of an object moving through space is expressed as a function of time. As
an object wanders through the data space, it triggers sensors that generate observations
in a cloud around it. Our model assumes that an object “generates” sensor observations
in a cloud around it by sampling repeatedly from a Gaussian (multidimensional normal)
probability density function (PDF) centered at the current location of the object, which
we denote by fobs. This randomized generative process nicely models the uncertainty
inherent in any sort of sensor-based system. As the object moves through the field, it
tends to trigger sensors that are located close to it, but the probability that sensor is
triggered falls off the further away from the object that the sensor is located.
If we know the exact path of each object through the field, given such a model it
is then a simple matter to perform the required partitioning of sensor observations by
making use of the principle of maximum likelihood estimation (MLE). Using MLE, each
observation is simply associated with the object that was most likely to have produced
it – that is, it is assigned to the object whose fobs function has the greatest value at the
location returned by the sensor.
Of course, the difficulty is that in order to make use of such a simple MLE, we must
first have access to the parameters defining fobs and the motion of the various objects
through the data space. This requires learning the parameters and the underlying motion
which can be very difficult – particularly so in our case, since we are unsure of the number
of objects K that are producing the sensor observations.
61
00
11
22 3
99
99
88
877
76 6
67
6
5444
45
55
33
2
2
8
0
10
00
11
22 3
33
2
2
0
1
00
11
22 3
6 66
6
5444
45
55
33
2
2
0
10
0
11
22 3
99
99
88
877
76 6
67
6
5444
45
55
33
2
2
8
0
1
0
00
(a) (b)
(c) (d)
Figure 4-2. Object path (a) and quadratic fit for varying time ticks (b-d)
One of the key aspects of our approach is that to make the learning process feasible,
we rely on two separate motion models: a restricted motion model that is used for only
the first few time ticks in order to recognize the number and initial motion of the various
objects, and an unrestricted motion model that takes this initial motion as input and
allows for arbitrary object motion. Given this, the following describes our overall process
for grouping sensor observations into objects:
1. First, we determine K and learn the set of model parameters governing objectmotion under the restricted model for the first few time ticks.
2. Next, we use K as well as the object positions at the end of the first few time ticksas input into a learning algorithm for the remainder of the timeline. The goal here isto learn how objects move under the more general, unrestricted motion model.
3. Finally, once all object trajectories have been learned for the complete timeline, eachsensor observation is assigned to the object that was most likely to have produced it.
Of course, this basic approach requires that we address two very important technical
questions:
1. What exactly is our model for how objects move through the data space, and how doobjects probabilistically generate observations around them?
2. How do we ”learn” the parameters associated with the model, in order to tailor thegeneral motion model to the specific objects that we are trying to model?
62
In the next section, we consider the answer to the first question. In section 4.4, we
consider the answer to the second question.
4.3 Generative Model
In this section, we define the generative model that we make use of in order to
perform the classification.
The high-level goal is to define the PDF fobs(x, t|θ). For a particular object, fobs takes
as input a sensor observation x and a time t and gives the probability that particular
object in question would have triggered x at time t. fobs is parametrized on the parameter
set θ, which describes how the object in question moves through the data space, and how
it tends to scatter sensor observations around it.
Before we describe the restricted and general motion models formally, it is worthwhile
to explain the need for two separate models. The parameter set θ governing the motion
of the object through the data space is unknown, and must be inferred from the observed
data. This is a difficult task. The problem is compounded when the observed data are
a set of mixed sensor observations generated by an unknown number of objects. Given
the difficulty inherent in learning θ, we choose to make the initial classification problem
simpler by allowing only a very restricted type of quadratic motion characterized by
constant acceleration.
Furthermore, object paths tend to be complex only over a relatively long time period.
That is, motion seems unrestricted only when we take a long-term view. This is illustrated
in Figure 4-2 where the initial quadratic approximations (for time periods [0 − 3] and
[0− 6]) are faithful to the object’s actual path. As the time period extends and the object
has a chance to change its acceleration, a simple quadratic fit is no longer appropriate
(Figure 4-2(d)).
We take advantage of the fact that a simple motion model may be reasonable for
a short time period, and learn the initial parameters of the generative process by using
a restricted motion model over a small portion δt of the time line. Once the initial
63
(b)(a)
2SD
Figure 4-3. Object path in a sensor field (a) and sensor firings triggered by object motion(b)
paramters are learned, we can make use of the unrestricted model for the remainder of the
timeline since there will be fewer unknowns and the computational complexity is greatly
reduced.
4.3.1 PDF for Restricted Motion
We will now describe the PDF for observations assuming a restricted motion model
that is valid for short time periods. In this model, the location of an object is expressed as
a function of time in the form of a quadratic equation. The restricted model assumes that
acceleration is constant. The position of an object at some time instance t is specified by
the parametric equation:
pt = a · t2 + v · t + p0
where p0 represents the initial position of the object, v the initial velocity, and a the
constant acceleration.
We define the probability of an observation x at time t by the PDF:
fobs(x, t | θ) = fN(x|Σobs,pt)
where
fN(x | Σ,p) =1
2π|Σ|1/2e−
12(x−p)T Σ−1(x−p)
is a Gaussian PDF that models the cloud of sensors triggered by the object at time t.
Figure 4-3 shows a typical scenario of how observations are generated. The parameter set
θ contains:
64
• The object’s initial position p0, initial velocity v, and acceleration a
• The covariance matrix Σobs specifying how the object produces sensor readingsaround itself.
4.3.2 PDF for Unrestricted Motion
While the restricted motion model may be applicable for a resonably small time
period, the fact is that accelerations do not remain constant for long. Thus, we make use
of a second PDF providing for more irregular motion, that can be used over longer time
periods. In the more general PDF, at each time tick an object moves not in a nice, smooth
trajectory, but instead moves through the data space in a completely random fashion.
Given that the object’s position in space at time t− 1 is pt−1, the object’s position at time
t is simply pt−1 + N , where N is a multidimensional, normally distributed random variable
parameterized on the covariance matrix Σmot. This random motion provides for a much
more general model.
One result of using such a very general model is that there is no longer a simple
equation for pt. Instead, pt has to be modeled by a random variable Pt which depends on
random variables for the object’s position at Pt−1 which by itself depends on the random
variable for object’s positon at Pt−2. Thus, the likelihood of observing pt must be specified
via a recursive PDF, where fmot(pt|θ) depends on fmot(pt−1|θ):
fmot(pt|θ) =
∫
pt−1
fN(pt|Σmot,pt−1)fmot(pt−1|θ)dpt−1
As we will discuss in Section 4.4, the fact that an object’s position is not specified
precisely by the parameter set θ and the recursive nature of the PDF make this motion
model much more difficult to deal with, which is why this more general motion model is
not used throughout the timeline.
Given the PDF describing the distribution of an object’s location at time t, it is
then possible to give a PDF specifying the probability that we observe a sensor reading
corresponding to the object at time t:
65
fobs(x, t|θ) =
∫
pt
fN(x|Σobs,pt)fmot(pt|θ)dpt
Thus, for the more general PDF, the parameter set θ contains:
• The covariance matrix Σmot specifying the object’s random motion.
• The object’s initial position p0.
• The covariance matrix Σobs specifying how the object produces sensor readingsaround itself.
Before we can actually map observations to individual objects, we must be able
to learn the two underlying models. As we will discuss in detail subsequently, the term
“learn” has a different meaning for each of the two models.
In the case of the restricted model, “learning” consists of computing the parameter
set θ for each object, as well as determining the number of objects. Since this is a classical
parameter estimation problem, we will make use of an MLE framework that will be solved
by an EM algortihm.
In the case of the unrestricted model, determining the parameter set is easy, in the
sense that once the parameter set of the restricted model has been learned, θ for the
unrestricted model is already fully determined (see Section 4.3.2). However, this does not
mean that use of the unrestricted model is easy. Because the model allows for arbitrary
motion, fmot as described before is not very useful by itself. Thus, our “learning” of the
unrestricted model makes use of Bayesian methods to update and restrict fmot. As we
process the data, sensor observations from the database are used in Bayesian fashion to
update fmot in order to refect a more refined belief in the position of the object. This
updated fmot places less weight in portions of the data space that do not contain sensor
observations relevant to an object in question, and more weight in portions that do.
4.4 Learning the Restricted Model
We begin our discussion of how to learn the restricted model by assuming that
number of objects K is known. We will address the extension to unknown K subsequently.
66
4.4.1 Expectation Maximization
Given a set of observations produced by a single object and the associated PDF
fobs(x, t|θ), the parameter θ can be learned by applying a relatively simple MLE. However,
in our case, the observations come from a set of K unknown objects where each object
potentially contributes to some fraction α of the sample. Note that the individual α’s need
not be uniform since an object moving in a dense field of sensors or a very large object
might produce more observations than an object moving in a sparser region. Given K
objects, the probability of an arbitrary observation x at time t is then given by:
p(x, t | Θ) =K∑
j=1
αj.fobs(x, t|θj)
where Θ = {αj, θj | 1 ≤ j ≤ K} denotes the complete parameter set and αj represents the
fraction of data generated by the jth object with the constraint that∑K
j=1 αj = 1.
Our goal is to learn the complete parameter set Θ. Applying MLE, we want to find a
Θ that maximizes the following likelihood function:
L(Θ | X ) =N∏
i=1
p(xi, ti)
where X = {(x1, t1), (x2, t2), · · · , (xN , tN)} is the set of observations from some initial
time period. As is standard practice, we instead try to maximize the log of the likelihood
since it makes the computations easier to handle:
log(L(Θ | X )) = logN∏
i=1
p(xi, ti) =N∑
i=1
log
(K∑
j=1
αj.fobs(xi, ti | θj)
)
Unfortunately, this problem is in general difficult because of the fact that we do not
know which observation was produced by which object. That is, if we had access to a
vector Y = {y1, y2, · · · , yN} where yi = j if the ith observation is generated by the jth
object, the maximization would be a relatively straightforward problem.
67
The fact that we lack access to Y can be addressed by making use of the EM
algorithm [84]. EM is an iterative algorithm that works by repeating the “E-Step”
and the “M-Step”. At all times, EM maintains a current guess as to the parameter set
Θ. In the E-Step, we compute the so-called “Q-function”, which is nothing more than the
expected value of the log-likelihood, taken with respect to all possible values of Y . The
probability of generating any given Y is computed using the current guess for Θ. This
removes the dependency on Y . The M-Step then updates Θ so as to maximize the value of
the resulting Q function. The process is repeated until there is little step-to-step change in
Θ.
In order to derive an EM algorithm for learning the restricted motion model, we must
first derive the Q function. In general, the Q function takes the form:
Q(Θ, Θi) = E[ log L(X ,Y | Θ) | X , Θi]
In our particular case, this can be expanded to:
Q(Θ, Θg) =N∑
i=1
K∑j=1
log(αj · p(xi, ti | θgj ))Pj,i
where Θg = {αgj , θ
gj | (1 ≤ j ≤ K)} represents our guess for the various parameters of the
K objects and Pj,i is the posterior probabilty that the ith observation came from the jth
object given by the formula:
Pj,i = P (j | xi, ti) =αg
jp(xi, ti | θgj )
K∑j=1
αgjp(xi, ti | θg
j )
(by Bayes Rule)
Once we have derived Q, we need to maximze Q with respect to Θ. Notice that we can
isolate the term containing αj and term containing θj by rewriting the Q function as
follows:
Q(Θ, Θg) =N∑
i=1
K∑j=1
log (αj)Pj,i +N∑
i=1
K∑j=1
log (p(xi, ti | θgj ))Pj,i
68
We can now maximize the above equation with respect to various parameters of interest.
This can be done using standard optimization methods [85]. Doing so results in the
following update rules for the parameter set θj for the jth object:
N∑i=1
Pj,i
N∑i=1
tiPj,i
N∑i=1
t2i Pj,i
N∑i=1
tiPj,i
N∑i=1
t2i Pj,i
N∑i=1
t3i Pj,i
N∑i=1
t2i Pj,i
N∑i=1
t3i Pj,i
N∑i=1
t4i Pj,i
·(
µj
vj
aj
)=
N∑i=1
xiPj,i
N∑i=1
xitiPj,i
N∑i=1
xit2i Pj,i
, Σj =
N∑i=1
(xi − µj)(xi − µj)T pj,i
N∑i=1
pj,i
and αj = 1N
N∑i=1
Pj,i.
Given these equations, our final EM algorithm is given as Algorithm 6.
Algorithm 6 EM Algorithm
1: while Θ continues to improve do2: for each object j do3: for each observation i do4: Compute Pj,i
5: end for6: Compute θj = (µj, vj, aj, αj, Σj) using Pj,i and update rules given above7: end for8: end while
4.4.2 Learning K
So far we have assumed that the number of objects K is known. However, in practice,
we often have very little knowledge about K, thus requiring us to estimate it from the
observed data. The problem of choosing K can be viewed as the problem of selecting
the number of components of a mixture model that describes some observed data. The
69
problem has been well-studied as it arises in many different areas of research, and a variety
of criterion has been proposed to solve the problem [86][87][88][89].
The basic idea behind the various techniques is as follows: Assume we have a model
for some observed data in the form of a parameter set ΘK = {θ1, ..., θK}. Further, assume
we have a cost function C(Θk) to evaluate the cost of the model. In order to select the
model with the optimal number of components, we simply compute Θ for a range of K
values and choose the one with the minimum cost:
K =argmin
K {C(ΘK) | Klow ≤ K ≤ Khigh}
The various techniques proposed in the literature can be distinguished by the cost
criterion they use to evaluate a model: AIC (Akaike’s Information Criterion), MDL
(Minimum Description Length) [88], MML (Minimum Message Length) [90], etc. For the
cost function, we make use of the Minimum Message Length (MML) criterion as it has
been shown to be competitive and even superior to other techniques [89]. MML is an
information theoretic criterion where different models are compared on the basis of how
well they can encode the observed data. The MML criterion nicely captures the tradeoff
between the number of components and model simplicity. The general formula [89] for the
MML criterion is given by:
C(Θk) = −logh(Θk)− logL(X | Θk) +1
2log|I(Θk)|+ c
2
(1 +
1
12
)
where h() describes the prior probabilities of the various parameters, L() the likelihood
of observing the data, |I| is the determinant of the fisher information matrix of the
observed data. For our specific case, we need a formulation that is applicable to Gaussian
distributions [87]:
C(Θk) = P2
∑Khigh
j=Klowlog
(Nαj
12
)+
Khigh
2log N
12+
Khigh(N+1)
2− logL(Y | Θk)
70
One final issue that should be mentioned with respect to choosing K is computational
efficiency. It is clearly unacceptable to re-run the entire EM algorithm for every possible
K in order to minimize the MML criteria. A solution to this is the component EM
framework [91]. In this variation of EM, a model is first learned with a very large K value.
Then, in an iterative fashion, poor components are pruned off and the model is re-adjusted
to incorporate any data that is no longer well-fit. For each resulting value of K, the MML
criteria is checked and the best model is chosen.
4.5 Learning Unrestricted Motion
Once the restricted model has been learned over a short portion of the timeline,
the next step is to use the learned parameters as a starting point in order to extend our
estimate of each object’s postition to the remainder of the timeline.
The learning process for the unrestricted model is quite different than for the
restricted model. The restricted model makes use of a classical parameter estimation
framework, where the goal is to learn the parameter set govening the motion of the object.
In the unrestricted case, the values for the parameter set for fmot are fully defined before
the learning ever begins:
• An object’s initial postiion p0 at the time the unrestricted model becomes applicablecan be computed directly from the parameters learned over the restricted model.
• Σobs can be taken directly from the unrestricted model, since it is one of the learnedparameters.
• Σmot can be determined in a number of ways from the restricted model, such asby using an MLE over the object’s time-tick to time-tick motion for the travectorylearned under the restricted model.
Thus, rather than relying on classical parameter estimation, we instead make use
of Bayesian techniques [62] to update fmot to take into account the various sensor
observations. fmot defines a distribution over pt for every time-tick t, which can be
viewed as describing a belief in the object’s position at time-tick t. In a Bayesian fashion,
this belief (e.g., distribution) can be updated and made more accurate by making use of
71
the observed data. We will use the notation f tmot to denote fmot updated with all of the
information up until time-tick t. Such Bayesian techniques for modeling motion are often
referred to as Bayesian filters [64].
4.5.1 Applying a Particle Filter
The mathematics associated with using a large number of discrete sensor observations
to update fmot quickly become very difficult – particularly in the case of Gaussian motion
– resulting in an unwieldy multi-modal posterior distribution. To address this, we make
use of a common method for rendering Bayesian filters practical, called a particle filter
[57]. A particle filter simplifies the problem by representing f tmot by a set of discrete
“particles”, where each particle is a possible current position for the object. We denote
the set of particles associated with time-tick t as St, and the ith particle in this set is
St[i]. The ith particle has a non-negative weight wi attached to it with the constraint∑
i wi = 1. Highly-weighted particles indicate that the object is more likely to be located
at or close to the particle’s position. Given St, f tmot(pt) simply returns wi if pt = St[i], and
0 otherwise.
The basic application of a particle filter to our problem is quite simple (though there
is a major complication that we will consider in the next subsection). To compute f tmot for
any time tick t, we use a recursive algorithm. For the base case t = 0, we have a single
particle located at p0, having weight one. Then, given a set of particles St−1 for time-tick
t− 1, the set St for time tick t is computed as given in Algorithm 2.
Algorithm 8 Sampling a particle cloud
1: for i = 1 to |St| do2: Roll a biased die so the chance of rolling i is wi
3: Sample from fN(Σmot, St−1[i])4: Add the result as the ith particle for |St|5: end for
Essentially, this is nothing more than sampling from the distribution representing
the object’s position at time t − 1, and then adding the appropriate possible random
motion to each sample. At this point, all weights are uniform. This gives us a discrete
72
respresentation for the prior distribution for the object’s position at time-tick t. We use
f t′mot to denote this prior.
The next step is to use the various sensor observations to update the prior weights in
order to obtain f tmot. For each particle:
wi = Pr[pt = St[i]] =
∏j=1
fN(xj|Σobs, St[i])
|St|∑
k=1
N∏j=1
fN(xj|Σobs, St[i])
Given St, it is then an easy matter to define an updated version of fobs that
corresponds to f tmot:
f tobs(x) =
|St|∑i=1
wifN(x|Σobs, St[i])
4.5.2 Handling Multiple Objects
Unfortunately, the simple filter described in the previous subsection cannot be
applied directly to our problem. Unlike in most applications of particle filters, we actually
have many objects producing sensor observations. As a result, it may be that a given
observation xj has nothing to do with the current object, which we denote by φ. As such,
this observation should not be used to update our beleif in the position of φ.
To handle this, we need to modify the basic framework. Rather than handling each
and every xj in a uniform fashion when dealing with φ, we instead associate a belief
(represented as a probability) with every xj. This belief tells us the extent to which we
think that xj was in fact produced by φ, and is computed as follows:
πφ,j =f t′
obs(xj|φ)∑Kk=1 f t′
obs(xj|k)
Note that f t′obs is the uniform-weighted version of f t
obs computed using the particles
associated with time-tick t, before the particle weights have been updated. f t′obs(x|φ)
73
denotes evaluation of the f t′obs function that is specifically associated with object φ at point
x.
Given this, we now need to produce an alternative formula for wi that takes into
account the possibility that xj was not produced by φ. In the case of a single object, wi
is simply the probability that φ would produce xj given that the actual object location
is St[i]. In the case of multiple objects, wi is the probability that the entire collection of
objects would produce xj given that the actual location of φ is St[i].
To derive an expression for this probabilty, we first compute the likelihood that
another object (other than φ) would produce x, given all of our beliefs regarding which
object produced x:
f t′obs(x|¬φ) =
∑k 6=φ πk,jf
t′obs(x|k)
1− πφ,j
Then, we can apply Bayes rule to compute wi:
wi =
∏j=1(1− πφ,j)f
t′obs(xj|¬φ) + πφ,jfN(xj|Σobs, St[i])∑|St|
k=1
∏j=1(1− πk,j)f t′
obs(xj|¬k) + πk,jfN(xj|Σobs, St[i])
This formula bears a bit of additional discussion. In the numerator, we simply take
the product of all of the likelihoods that are associated with each sensor observation, given
that object φ is actually present at location St[i]. The likelihood of observing xj under this
constraint is (1− πφ,j) × f t′obs(xj|¬φ) + πφ,j × fN(xj|Σobs, St[i]). The first of the two terms
that are summed in this expression is the likelihood that an object other than φ would
produce xj, multiplied by our belief that this is the case. The second of the terms is the
likelihood that φ (located at position St[i]) would produce xj multiplied by our belief that
this is the case. The deonominator in the expression is simply the normalization factor,
which is computed by summing the likelihood over all possible positions of object φ.
74
wi = Pr[pt = St[i]] =
∏j=1
fN(xj|Σobs, St[i])
|St|∑
k=1
N∏j=1
fN(xj|Σobs, St[i])
However, the method has a problem. The update formula in step (2) is valid only
when the observations Xt are for a single object i.e. (K = 1). For multiple objects (i.e
K > 1) we cannot use this formulation since we are allowing potential observations from
other objects to influence the weight update of samples for any given object. To update
the belief of the jth object, we should ideally make use of only some set of observations
Xjt ⊂ Xt attributed to it. Thus, we need a new strategy that updates the weights of
samples for a given object that takes in to account the contribution of other objects
present in the field to the observation set. Our modified update strategy is explained
below.
4.5.3 Update Strategy for a Sample given Multiple Objects
For the purpose of this discussion, we consider K objects where each object is
represented by |St| samples, and N observations produced at time t represented by Xt.
First, we need some definitions.
Prior Probability of an Object: We denote the prior probabilty of some
observation xi given some object j by srcj. This is obtained via Bayes Rule as follows:
srcj = p(xi|j) =p(xi|j)∑Kj=1 p(xi|i)
Probabilty of an Observation: We define the probability xi of an observation in
reference to some object j by the function fobs. There are two variations of this function.
For a given object position pt of object j, the function fobs(xi|j) is described by a
Gaussian PDF fN(.) parametrized on θj = (pt, Σobs).
fobs(xi | j) =1
2π|Σobs|1/2e−
12(xi−pt)T Σ−1
obs(xi−pt)
75
where Σobs describes how observations are scattered around the path of object j.
For a given sample position skm of object k, the function fobs(xi|sk
m) describes the
probability that sensor m of object k can trigger observation xi. In this case, fobs is
parametrized on θ = (skm, Σsensor) where Σsensor describes the width of the region around
which a sensor is able to record observations.
Likelihood of an Observation: The likelihood of an observation xi with respect to
some object j is simply:
L(j|xi) = c ·K∑
j=1
srcj · fobs(xi|j)
where c is a marginalizing constant. The likelihood can be viewed as the possiblity of the
jth object to have triggered the observation.
Weight of a sample Given the prior srcj, the PDF fobs and the likelihood L(j|xi)
we can update the weight associated with some sample m of object k as follows:
wkm = p(sk
m|Xt) =
∑
j 6=k
(srcj · fobs(xi|j)) + srck · fobs(xi|skm)
|St|∑
l=1
xi · fobs(xi|skl )
The update equation for a set of observations X = (x1, · · ·xN) is simply:
wkm = p(sk
m|Xt) =
N∏i=1
(∑
j 6=k
(srcj · fobs(xi|j)) + srck · fobs(xi|skm)
)
|St|∑
l=1
N∏i=1
xi · fobs(xi|skl )
4.5.4 Speeding Things Up
A close examination of the formulas from the previous subsection shows that their
evaluation requires considerable computation, especially if the number of particles per
object, the number of objects, and/or the number of sensor observations per time tick
is very large. However, a couple of simple tricks can alleviate a substantial portion of
the complexity. First, when computing each πphi,j value, we can make use of the average
76
or centroid of object φ in order to compute f t′obs, rather than considering each particle
separately. Thus, we approximate f t′obs with f t
obs(x) ≈ fN(x|Σobs + Σpart, µ) where
µ =∑|St|
i=1 wi × St[i] and Σpart is the covariance matrix computed over the positions of each
particle in St.
Second, for any given object, on average only slightly more than 1/K of the
observations at a given time tick actually apply to it. This is because we are usually
quite sure which observation applies to which object, and for only a few observations is
this in doubt. For a given object, those observations that do not apply to it will have very
low correponding π values and will not affect wi. Thus, when computing wi we first drop
any observation j for which πφ,j does not exceed ε. This can be expected to achieve a
speedup factor of close to K.
4.6 Benchmarking
This section presents an experimental evaluation of the proposed algorithm. The goal
of the evaluation is to answer the following questions:
• How many objects is the algorithm able to classify effectively?
• Is the algorithm able to effectively classify object observations that can span a longperiod of time?
• How accurate is the proposed algorithm in classifying observations into objects?
• Is there an advantage to using the particle filter step?
Methodology. The experiments were conducted over a simple, synthetic database.
This allows us to easily measure the sensitivity of the the algorithm to varying data
characteristics and parameter settings. The database stores sensor observations from a set
of moving objects spanning some time interval.
The database is generated as follows: The various objects are initialized randomly
throughout a 2D field and allowed to perform a random walk through the field. As the
objects move through the field, their position at various time ticks are recorded. At each
time tick, sensor observations were generated as a Gaussian cloud around the various
77
object locations. A snapshot of recorded observations from 40 objects is shown in Figure
4-4.
For various parameter settings, we measure the wall-clock exection time required
to classify all the observations and the classification accuracy of the algorithm. As is
standard practice in machine learning, classification accuracy is measured through the
recall and precision parameters. For a given object, recall denotes the total number of
observations that the classifier assigns to the object and precision denotes the actual
number of observations that are actually produced by the object. Ideally, recall and
precision should be close to 1.
Given this setup, we vary five parameters and measure their effect on execution time and
classification accuracy:
1. numObj: the number of unique objects that produced the observations
2. numTicks: the number of time ticks over which observations were collected
3. stdDev: the standard deviation of the average Gaussian sensor cloud
4. numObs: the average number of sensor firings for a given Gaussian sensor cloud
5. emTime: the portion of the intial time line over which EM was used
Five separate tests were conducted:
1. In the first test, numTicks is fixed at 50, stdDev is fixed at 2% of the width of thefield, and numObs is set at 5. emTime is fixed at 5, numObj is varied from 10 to 110objects in increments of 30.
2. In the second test, numObj is fixed at 40, stdDev is fixed at 2% of the width of thefield, emTime is fixed at 5, and numObs is set at 5. The time interval over whichobservations were recorded numTicks is varied in increments of 25 upto 100 timeticks.
3. In the third test, numObj is fixed at 40, numTicks is fixed at 50, stdDev is fixedat 2% of the width of the field, emTime is fixed at 5, and the average number ofsensor firings generated per object at each time tick numObs is varied from 5 to 25 inincrements of 5.
78
-200
0
200
400
600
800
1000
1200
0 200 400 600 800 1000 1200
Figure 4-4. The baseline input set (10,000 observations)
-200
0
200
400
600
800
1000
1200
0 200 400 600 800 1000 1200
Figure 4-5. The learned trajectories for the data of Figure 4-4
4. In the fourth test, numObj is fixed at 40, numTicks is fixed at 50, numObs is set at 5.We then vary the spread of the Gaussian cloud stdDev from 2% to 10% of the widthof the field.
79
numObj 10 40 70 110
Recall 1.0 0.91 0.76 0.69
Precision 1.0 0.92 0.92 0.93
Runtime 9 sec 38 sec 131 sec 378 secTable 4-1. Varying the number of objects and its effect on recall, precision and runtime.
numTicks 25 50 75 100
Recall 0.93 0.91 0.75 0.64
Precision 0.96 0.93 0.92 0.92
Runtime 21 sec 38 sec 59 sec 72 sec
Table 4-2. Varying the number of time ticks.
numObs 5 10 15 20
Recall 0.91 0.91 0.91 0.92
Precision 0.93 0.92 0.92 0.91
Runtime 38 sec 71 sec 102 sec 134 sec
Table 4-3. Varying the number of sensors fired.
stdDev 2% 5% 7% 10%
Recall 0.91 0.90 0.88 0.80
Precision 0.93 0.94 0.91 0.83
Runtime 38 sec 37 sec 37 sec 38 sec
Table 4-4. Varying the standard deviation of the Gaussian cloud.
5. In the final test, numObj is fixed at 40, numTicks is fixed at 50, emTime is fixed at5, stdDev is fixed at 2% of the width of the field, and numObs is set at 5. emTime isvaried from 5 to 20 time ticks in increments of 5.
All tests were carried out in a dual-core Pentium PC with 2 GB RAM. The tests were
run in two stages. First, the EM algorithm is run to get an initial estimate of the number
of objects and their starting location. The number of time ticks over which EM is applied
is controlled by the emTime parameter. Next, the estimates produced by EM are used to
bootstrap the particle filter phase of the algorithm, which tracks the individual objects
for the rest of the timeline. In a post-processing step, the recall and precision values are
computed. Each test was repeated five times and the results were averaged across the
runs.
80
emTime 5 10 15 20
Recall 0.91 0.88 0.87 0.83
Precision 0.92 0.95 0.94 0.96
Runtime 38 sec 38 sec 37 sec 37 sec
Table 4-5. Varying the number of time ticks where EM is applied.
Results. The results from the experiments are given in Tables 4-1 through 4-5. All
times are wall-clock running times in seconds. The smallest database processed consisted
of around 5000 observations from 40 objects over 25 time ticks. The largest data set
processed consisted of around 40,000 observations from 40 objects over 50 time ticks. Disk
I/O cost is limited to the initial loading of the data set into the main memory. Just to
give an visual illustration, the actual plot of the sensor firings and the learned trajectories
is given in Figures 4-5 and 4-6 for the baseline configuration (numObj 40, numTicks 50,
numObs 5, stdDev 2%, emTime 5).
Discussion. There are several interesting findings. In general, the accuracy of the
algorithm suffers as we vary the parameter of interest from low to high. The algorithm
seems to be particularly sensitive to both the number of objects and the length of the time
interval over which observations are obtained.
As Table 4-1 demonstrates, the classification accuracy suffers as we increase the
number of objects considered by the algorithm. This is because with increasing number
of objects, spatial discrimination is greatly reduced. Objects with observation clouds that
are not well separated are grouped together as a single component by the EM stage of the
algorithm. This has the effect of reducing the total number of objects that is tracked by
the particle filter stage of the algorithm, and the observations from the untracked objects
contribute to a reduced recall.
A somewhat different issue arises when the length of the time interval is increased
(Table 4-2). When the time interval is increased, we increase the chance that the paths
traced by two arbitrary objects will collide. Whenever object paths overlap or intersect,
the individual particle filters tracking the objects can no longer perform any meaningful
81
discrimination between the objects. When this happens, the filters end up dividing the
observations among themselves in an arbitrary manner. A somewhat subtle issue arises
when two objects intersect briefly and then diverge. In this case, the individual particle
filters may end up swapping the objects. Similar factors are in play when we increase the
spread of the sensor firings around object paths (Table 4-3).
A somewhat surprising finding (Table 4-4) is that increasing the density of observations
does not seem to cause any noticeable improvement in classification accuracy other than
increasing the run times. Finally, our use of the EM stage only to bootstrap the particle
filter phase is validated by the results shown in Table 4-5. If EM is used for more than a
few initial time ticks, the limitations of the restricted model employed by EM come in to
play, and result in poor estimates being fed to the filter stage.
4.7 Related Work
There is a wealth of database research on supporting analytics over object paths. This
includes trajectory indexing [18, 21, 92–94], queries over tracks [75, 76, 83] and clustering
paths [35, 37, 38, 95, 96]. However, little work exists in databases that worry about how to
actually obtain the object path.
The only prior work in database literature closely related to the problem we address is
the work of Kubi et al [36]. Given a set of asteroid observations, they consider the problem
of linking observations that correspond to the same underlying asteroid. Their approach
consists of building a forest of k-d trees[97], one for each time tick, and performing a
synchronized search of all the trees with exchange of information among tree nodes to
guide the search towards feasible associations. They assume that each asteroid has atmost
one observation at every time tick and consider only simple linear or quadratic motion
models.
Modeling based approaches [85, 98, 99] have been previously employed in target
tracking to map observations in to targets. The focus is primarily on supporting real-time
tracking using simple motion models. In contrast to existing research, we focus on aiding
82
the ETL process in a warehouse context to support entity resolution and provide historical
aggregation of object movements.
4.8 Summary
This chapter described a novel entity resolution problem that arises in sensor
databases and then proposed a statistical learning based approach to solving the problem.
The learning was carried out in two stages: an EM algorithm applied over a small portion
of the data in the first stage to learn initial object patterns, followed by a particle-filter
based algorithm to track the individual objects. Experimental results confirm the utility of
the proposed approach.
83
CHAPTER 5SELECTION OVER PROBABILISTIC SPATIOTEMPORAL RELATIONS
For nearly 20 years, database researchers have produced various incarnations of
probabilistic data models [100–106]. In these models, the relational model is extended so
that a single stored database actually specifies a distribution of possible database states,
where these possible states are also called possible worlds. In this sort of model, answering
a query is closely related to the problem of statistical inference. Given a query over the
database, the task is to infer some characteristic of the underlying distribution of possible
worlds. For example, the goal may be to infer the probability that a specific tuple appears
in the answer set of a query exceeds some user-specified p.
Along these lines, most of the existing work on probabilistic databases has focused
on providing exact solutions to various inference problems. For example, imagine that
one relation R1 has an attribute lname, where exactly one tuple t from R1 has the value
t.lname = ‘Smith’. The probability of t appearing in a given world is 0.2. t also has
another attribute t.SSN = 123456789, which is a foreign key into a second database table
R2. The probability of 123456789 appearing in R2 is 0.6. Then (assuming that there are
no other ’Smith’s in the database) the probability that ’Smith’ will appear in the output
of R1 ×R2 can be computed exactly as 0.2 × 0.6 = 0.12.
Unfortunately, probabilistic data models where tuples or attribute values can be
described using simple, discrete probability distributions may be of only limited utility in
the real world. If the goal is to build databases that can represent the sort of uncertainty
present in modern data management applications, it is very useful to handle complex,
continuous, multi-attribute distributions. For example, consider an application where
moving objects are automatically tracked—perhaps by video, magnetic, or seismic
sensors—and the observed tracks are stored in a database. The standard, modern method
for automatic tracking via electronic sensory input is the so-called “particle filter” [63],
which generates a complex, time-parameterized probabilistic mixture model for each
84
object that is tracked. If this mixture model is stored in a database, then it becomes
natural to ask questions such as “Find all of the tracks that entered area A from time tstart
to time tend with probability of greater than (p× 100)%.” Answering this question involves
somehow computing the mass of each object’s time-varying positional distribution that
intersects A during the given time range, and checking if it exceeds p.
For many applications, such a problem can be quite difficult—it may be that no
closed-form (and integratable) probability density function (PDF) is even available. For
example, Bayesian inference [62] is a popular method that is commonly proposed as a
way to infer unknown or uncertain characteristics of data—one standard application of
Bayesian inference is automatically guessing the topic of a document such as an email.
The so-called “posterior” distribution resulting from Bayesian inference often has no
closed form, cannot be integrated, and can only be sampled from, using tools such as a
Markov Chain Monte Carlo (MCMC) methods [107].
Thus, in the most general case, an integratable PDF is unavailable, and the user
can only provide an implementation of a pseudo-random variable that can be used to
provide samples from the probability distribution that he or she wishes to attach to an
attribute or set of correlated attributes. By asking only for a pseudo-random generator,
we can handle both difficult cases (such as the Bayesian case) and simpler cases where
the underlying distribution is well-known and widely used (such as Gaussian, Poisson,
Gamma, Dirichlet, etc.) in a unified fashion. Myriad algorithms exist for generating
Monte Carlo samples in a computationally efficient manner [108]. For more details on
how a database system might support user-defined functions for generating the required
pseudo-random samples, we point to our earlier paper on the subject [109].
Our Contributions. If the user is asked only to supply pseudo-random attribute value
generators, it becomes necessary to develop new technologies that allow the database
system to integrate the unknown density function underlying a pseudo-random generator
over the space of database tuples accepted by a user-supplied query predicate. In this
85
chapter, I consider the problem of how the required computations can and should be
performed using Monte Carlo in a principled fashion. I propose very general algorithms
that can be used to estimate the probability that a selection predicate evaluates to true
over a probabilistic attribute or attributes, where the attributes are supplied only in the
form of a pseudo-random attribute value generator.
The specific technical contributions are as follows:
• I carefully consider the statistical theory relevant to applying Monte Carlo methodsto decide whether a database object is “sufficiently accepted” by the query predicate.Unfortunately, it turns out that due to peculiarities specific to the application ofMonte Carlo methods to probabilistic databases, even so-called “optimal” methodscan perform quite poorly in practice.
• I devise a new statistical test for deciding whether a database object should beincluded in the result set when Monte Carlo is used. In practice, the test can be usedto scan a database and determine which objects are accepted by the query using farfewer samples even than existing, “optimal” tests.
• I also consider the problem of indexing for relational selection predicates over aprobabilistic database to facilitate fast evaluation using Monte Carlo techniques.
Chapter Organization. In Section 2, we define the basic problem of evaluating selection
predicates when probabilistic attribute values can only be obtained via Monte Carlo
methods, and consider the false positive and false negative problems associated with
testing whether a database object should be accepted. Section 3 describes a classical test
from statistics that is very relevant, called the sequential probability ratio test (SPRT).
Section 4 describes our own proposed test which makes use of the SPRT. Section 5
considers the problem of indexing for our test. Experiments are described in Section 6,
Section 7 covers related work, and Section 8 concludes the chapter.
5.1 Problem and Background
In this section, we first define the basic problem: relational selection in a probabilistic
database where uncertainty is represented via a black-box, possibly multi-dimensional
sampling function. While we limit the scope by considering only relational selection, we
note that join predicates are really nothing more than simple relational selection over a
86
cross product, and so joins can be handled in a similar fashion. We begin by discussing
the basic problem definition, and then give the reader a step-by-step tour through the
relevant statistical theory, which will provide the necessary background to discuss our own
technical contributions in the following sections.
5.1.1 Problem Definition
We consider generic selection queries of the form: SELECT obj
FROM MYTABLE AS obj
WHERE pred(obj)
USING CONFIDENCE p
pred() is some (any) relational selection predicate over database object obj. pred() may
include references to probabilistic and non-probabilistic attributes, that may or may
not be correlated with one another. For an example of this type of query, consider the
following: SELECT v.ID
FROM VEHICLES v
WHERE v.latitude BETWEEN 29.69.32 AND 29.69.38 AND
v.longitude BETWEEN -82.35.12 AND -82.35.19
USING CONFIDENCE 0.98
This query will accept all vehicles falling in the specified rectangular region with
probability higher than 0.98.
In general, our assumption is that there is a function obj.GetInstance() which
supplies one random instance of the object obj (note that a “random instance” could
contain a mix of deterministic and non-deterministic attributes, where deterministic
attributes have the same value in every sample). In our example, latitude and longitude
could be supplied by sampling from a two-dimensional Gaussian distribution.
5.1.2 The False Positive Problem
87
Algorithm 9 MonteCarlo (MYTABLE, p, n)
1: result = ∅2: for each object obj in MYTABLE do3: k = 04: for i = 1 to n do5: if pred(obj.GetInstance() = true) then6: k = k + 17: end if8: end for9: if (k/n ≥ p) then
10: result = result ∪ obj11: end if12: end for13: return result
Given this interface, one way to answer our basic selection query is to use Monte
Carlo methods to guess whether or not each object should be contained in the answer set,
as described in Algorithm 5-1.
For every database object, a number of random instances of the object are generated
by the Algorithm; the selection predicate pred() is applied to each of them, and the object
is accepted or rejected depending upon how many times pred() evaluates to true.
While this algorithm is quite simple, the obvious problem is that there may be some
inaccuracy in the result. For example, imagine that p is .95, n is 1000, and for a given
object, the counter k ends up as 955. While it may be likely that the probability that
the object is accepted by pred() exceeds .95, this does not mean that the object should
necessarily be accepted; there is a possibility that the “real” chance that a object is
accepted by pred() is 94%, but we just got lucky, and 95.5% of the samples were accepted.
The chance of making such an error is intimately connected with the value n; the larger
the n, the lower the chance of making an error.
For this reason, we might modify our query slightly so that the USING clause also
includes a user-specified false positive error rate:
USING CONFIDENCE p
88
FALSE POSITIVE RATE α
To actually incorporate this error rate into our computation, it becomes necessary
to modify Algorithm 5-1 so that it implements a statistical hypothesis test to guard
against the error. For a given object obj, the inner loop of Algorithm 5-1 runs n
Bernoulli (true/false) random trials, and counts how many true results are observed.
There are many ways to accomplish this. For example, if the real probability that
pred(obj.GetInstance()) is true is p, then the number of observed true results
will follow a binomial distribution. For a given database object obj, we will use π as
shorthand for this probability; that is, π = Pr[pred(obj.GetInstance()) = true]. Using
the binomial distribution, we can set up a proper, statistical hypothesis test with two
competing hypotheses:
H0 : π < p versus H1 : π ≥ p
To do this, we use the fact that:
Pr[k ≥ c | π = p] ≤n∑
k=c
binomial(p, n, k)
Thus, if we want the chance of erroneously including obj in the result set when it should
not be included (that is, when H0 is true) to be less then a user-supplied α, we should
accept obj only if the number of observed true results, k, meets or exceeds the value c,
where∑n
k=c binomial(p, n, k) does not exceed the user-supplied α. Thus, we can first
compute the largest such c, and replace the last “if” statement (lines (9)-(11)) in the
pseudo-code for Algorithm 5-1 with:
if k ≥ c then
result = result ∪ obj
end if
89
Then, we will be sure that we will be unlikely to erroneously include “incorrect”
results in the answer set.
5.1.3 The False Negative Problem
There is a key problem with this approach: it only guards against false positives,
and provides no protection against false negatives; using the terminology common in the
statistical literature, this approach provides no guarantees as to the “power” of the test.1
Fortunately, standard statistical methods make it easy to handle a lower bound of
the power of the test. Assume that we alter the query syntax so that the desired power is
specified directly in the query:
USING CONFIDENCE p
FALSE POSITIVE RATE α
FALSE NEGATIVE RATE β
Then, given some small value ε, we wish to choose from one of the two alternatives:2
H0 : π = p + ε versus H1 : π = p− ε
When evaluating a query, either H0 or H1 should be chosen for each database object obj,
subject to the constraint that:
• Pr[Accept(H0) | π ≤ p− ε] is less than α
• Pr[Accept(H1) | π ≥ p + ε] is less than β
1 In fact, this particular binomial test is quite weak compared to other possible tests.
2 We assume that ε is an internal system parameter that is not chosen directly by theuser—this is an important point discussed in depth in Section 4. ε is set to be smallenough that no reasonable user would care about the difference between p + ε andp − ε. Since most PDFs stored in a so-called “probabilistic database” are the result ofan inexact modeling and inference process that introduces its own error and renders veryhigh-precision query evaluation of somewhat dubious utility, ε should not be too small inpractice. We expect that on the order of 10−4 might be reasonable.
90
If these two constraints are met, then when H0 is accepted we can put obj into the answer
set and be sure that the probability of incorrectly including obj is at most α, and when H1
is accepted we can safely leave obj out of the answer set and be sure that the probability
of incorrectly dismissing obj is at most β.
Fortunately, it is quite easy to do this using standard statistical machinery known as
the Neyman-Pearson test [110], or Neyman test for short. For a given database object obj,
the Neyman test chooses between H0 or H1 by analyzing a fixed sample of size n drawn
using GetInstance(). The test relies on a likelihood ratio test (LRT) that compares
the probabilities of observing the sample sequence under H0 and H1. It is named after
a theoretical result (the Neyman-Pearson lemma) that states that a test based on LRT
is the most powerful test of all possible tests for a fixed sample size n comparing the
two simple hypotheses (i.e. it is a uniformly-most-powerful test). Since the Neyman test
for the Bernoulli (yes/no) probability case is given in many textbooks on hypothesis
testing, we omit its exact definition here. Given an implementation of a Neyman test that
returns ACCEPT if H1 is selected, it is possible to replace lines (9) to (11) of Algorithm 5-1
with:
if (Neyman (obj, pred, p, ε, α, β) = ACCEPT ) then
result = result ∪ obj
end if
The resulting framework will then correctly control the false positive and false
negative rates associated with the underlying query evaluation.
5.2 The Sequential Probability Ratio Test (SPRT)
While the Neyman test is theoretically “optimal”, it is important to carefully consider
what the word “optimal” means in this context: it means that no other test can choose
between H0 and H1 for a given α and β pair in fewer samples—specifically, no other
test can do a better job when either H0 or H1 is true. The problem is that in a real,
probabilistic database there is little chance that either H0 or H1 is true: these two
91
Figure 5-1. The SPRT in action. The middle line is the LRT statistic
hypothesis relate to specific, infinitely-precise probability values p + ε and p − ε, when in
reality the true probability π is likely to be either greater than p + ε or less than p − ε,
but not exactly equal to either of them. In this case, the Neyman test will still be correct
in the sense that while still respecting α and β, it will choose H0 if π < p − ε and H1 if
π > p − ε. However, the test is somewhat silly in this case, because it still requires just
as many samples as it would in the hard case where π is precisely equal to one of these
values.
To make this concrete, imagine that p = .95, and after 100 samples have been taken
from GetInstance(), absolutely none of them have been accepted by pred(), but the
Neyman algorithm has determined that in the worst case, we need 105 to choose between
H0 and H1. Even though there is a probability of at most (1−.95)100 < 10−130 of observing
100 consecutive false values if π was at least 0.95, the test cannot terminate—meaning
that we must still take 99,900 more samples. In this extreme case we would like to be able
to realize that there is no chance that we will accept H1 and terminate early with a result
of H0. In fact, this extreme case may be quite common in a probabilistic database where p
will often be quite large and pred() highly selective.
Not surprisingly, this issue has been considered in detail by the statistics community,
and there is an entire subfield of work devoted to so-called “sequential” tests. The basis
92
for much of this work is Wald’s famous sequential probability ratio test [111], or SPRT
for short. The SPRT can be seen as a sequential version of the Neyman test. At each
iteration, the SPRT draws another sample from the underlying data distribution, and uses
it to update the value of a likelihood ratio statistic. If the statistic exceeds a certain upper
threshold, then H1 is accepted. If it ever fails to exceed a certain lower threshold, then H0
is accepted. If neither of these things happen, then at least one more iteration is required;
however, the SPRT is guaranteed to end (eventually).
Thus, over time, the likelihood ratio statistic can be seen as performing a random
walk between two moving “goalposts”. As soon as the value of the statistic falls outside
of the goalposts, a decision is reached and the test is ended. The process is illustrated in
Figure 5-1. This plot shows the SPRT for a specific case where π = .5, ε = .05, p = .3, and
α = β = 0.05. The x-axis of this plot shows the number of samples that have been taken,
while the wavy line in the middle is the current value of the LRT statistic. As soon as the
statistic exits either boundary, the test is ended.
The key benefit of this approach is that for very low values of π that are very far from
p, H0 is accepted quickly (H1 is accepted with a similar speed when π greatly exceeds
p). All of this is done while fully controlling for the multiple-hypothesis-testing (MHT)
problem: when the test statistic is checked repeatedly, then extreme care must be taken
with respect to α and β because there are many chances to erroneously accept H0 (or
H1), and so the effective or real α (or β) can be much higher than what would naively
be expected. Furthermore, like the Neyman test, the SPRT is also “optimal” in the sense
that on expectation, it requires no more samples than any other sequential test to choose
between H0 and H1, assuming that one of the two hypotheses are true.
Just like the neyman test, the SPRT makes use of a likelihood ratio statistic. in the
bernoulli case we study here, after numacc samples that are accepted by pred() out of num
93
Algorithm 10 SPRT (obj, pred, p, ε, α, β)
1: mult=ab
2: tot = 0
3: numAcc = 0
4: constUp =log 1−β
α
b
5: constDown =log β
1−α
b
6: while (constDown + tot < numAcc < constUp + tot ) do7: sample = obj.GetInstance()8: if (pred(obj.GetInstance () = true) then9: tot = tot + mult
10: end if11: end while12: if (numacc >= constup + tot) then13: decision = accept
14: else15: decision = reject
16: end if17: return decision
total samples, this statistic is:
λ = numacc logp + ε
p− ε+ (num− numacc) log
1− p− ε
1− p + ε
given λ, the test continues as long as:
logβ
1− α< λ < log
1− β
α
for simplicity, this can be re-worked a bit. let:
a = log1− p + ε
1− p− ε, b = log
p + ε
p− ε− log
1− p− ε
1− p + ε
then the test continues as long as:
numacc ≥ log 1−βα
b+ num
a
b
and:
numacc ≤ log β1−α
b+ num
a
b
94
this leads directly to the pseudo-code for the basic sprt algorithm, which can be inserted
into algorithm 1 to produce a test which uses an adaptive sample size to choose between
h0 and h1. The pseudo-code is given as Algorithm 2.
5.3 The End-Biased Test
In this section, we devise a new, simple sequential test called the end-biased test that
is specifically designed to work well for queries over a probabilistic database.
5.3.1 What’s Wrong With the SPRT?
To motivate our test, it is first necessary to consider why the SPRT and its existing,
close cousins may not be the best choice for use in a probabilistic database.
The SPRT and its variants (of which there are many—see the related work section)
are widely-used in practice. Unfortunately, there is a key reason why the SPRT as it
was originally defined is not a good choice for use in the inner-more loop of a selection
algorithm over a probabilistic database: the existence of the “magic” constant ε.
In classical applications of the SPRT, ε (that is, the distance between H0 and H1)
is carefully chosen in an application-specific fashion by an expert who understands the
significance of the parameter and its effect on the cost of running the SPRT. For example,
a widget producer may wish to sample the widgets that his/her assembly line produces to
see if the unknown rate of defective widgets is acceptable by sampling from the widgets
that are produced by the line. In this setting, ε would be chosen so that there is an
economically significant difference between H0 and H1, while at the same time taking
into account the adverse effect of a small ε; a small ε is associated with a large number
of (expensive) samples. That is, p − ε is likely chosen so that the defect rate is so low
that it would be a waste of money and time to stop the production line and determine
the problem. On the other hand, p + ε is chosen at the point where production must
be stopped, because so many defective widgets are produced that the associated cost is
unacceptable. The widget producer understands that if the true rate of defective widgets
is between p − ε and p + ε, the SPRT may return either result, so there is a natural
95
inclination to shrink its value; however, he/she is also strongly motivated to make ε as
large as possible because she/he also understands that a small ε will require that more
widgets be sampled, which increases the cost of the quality control program.
Unfortunately, in the context of a probabilistic database, the existence of a user-defined
ε parameter with such a profound effect on the cost of the test is highly problematic. We
contrast this with the fairly intuitive nature of the parameter p. user might choose
p = 0.95 if she/he wants only those objects that are “probably” accepted by pred().
She/he might choose p = 0.05 if she/he wants any object with even a slight chance of
being accepted by pred(). p may even be computed automatically in the case of a top-k
query. But what about ε? Without an in-depth understanding of the effect of ε on the
SPRT, the choice of ε is necessarily arbitrary. A user may wonder, “why not simply choose
ε = 10−5 to ensure that all results are highly accurate?” The reason is that this may
(or may not) have a very significant effect on the speed of query evaluation, depending
upon many factors that include the particular predicate chosen as well as the underlying
probabilistic model—but it is not acceptable to ask an end-user to understand and
account for this!
5.3.2 Removing the Magic Epsilon
According to basic statistical theory, it is impossible to do away with ε altogether.
Intuitively, no statistical hypothesis test can decide between two options that are almost
identical. Thus, our goal is to take the choice of ε away from the user, and simply ship
the database software with an ε that is small enough that no reasonable user would care
about the error induced (see the relevant discussion in Footnote 1). The problem with this
plan is that the classical SPRT may require a very large number of samples to terminate
with a small ε. For example, consider the following, simple test. We choose p = 0.5,
ε = 10−5, and π = .2, and run Algorithm 5-3: in this case, it turns out that more than ten
thousand samples are required for the test to terminate. For a one-million object database,
generating this many samples per object is probably unacceptable.
96
Figure 5-2. Two spatial queries over a database of objects with gaussian uncertainty
The unique problem in the database context is that while H0 and H1 are very close
to one another (due to a tiny ε), in reality, π is typically very far from both p − ε and
p + ε; usually, it will be close to zero or one. For example, consider Figure 5-2, which
shows a simple spatial query over a database of objects whose positions are represented
as two-dimensional Gaussian density functions (depicted as ovals in the figure). For both
the more selective query at the left and the and less selective query at the right, only the
few objects falling on the boundary of the query region would have π ≈ p ± ε for any
user-specified p 6= 0, 1.
This creates a unique setup that is quite different from classic applications of the
SPRT and its variants. In fact, the SPRT itself is provably optimal for only π values lying
at p − ε and p + ε; but for those far from these boundaries (such as at zero and one), it
may do quite poorly. Many other “optimal” tests have been proposed in the statistical
literature, but few seem to be applicable to this rather unique application domain—see the
experimental section as well as the related work section for more details.
5.3.3 The End-Biased Algorithm
As a result, we propose our own sequential test that is specifically geared towards
operating in a database environment, where (a) ε is vanishingly small, (b) π for the typical
object is close to 0 or 1, and (c) only for a few objects is π ≈ p.
The algorithm that we develop is called the end-biased test. Unlike many of the tests
from statistics, it has no optimality properties, but by design it functions very well in
practice—an issue we consider experimentally in Section 6.
97
Figure 5-3. The sequence of SPRTs run by the end-biased test
To perform the end-biased test, we run a series of pairs of individual hypothesis tests.
In the first pair of tests, one SPRT is run right after another:
• The first SPRT tries to reject the object quickly, in just a few samples, if this ispossible. To do this, a standard SPRT is run to decide between H0 : π = p/2, andH1 : π = p + ε. If the SPRT accepts H0, then obj is immediately rejected. However,if the SPRT accepts H1, then a second test is run.
• The second test tries to accept the object quickly, again in just a few samples, if thisis possible. To do this, a standard SPRT is run to decide between H0 : π = p − ε,and H1 : π = p + 1−p
2. If the SPRT accepts H1, then obj is immediately accepted.
However, if the SPRT accepts H0, then the object survives for another round oftesting.
98
The first pair of tests is set up so that the “region of indifference” (that is, the region
between H0 and H1 in each test) is very large. A large region of indifference tends to
speed the execution of the test. Intuitively, the reason for this is that it is much easier to
decide between two disparate values for p such as p = .1 and p = .9 than it is to decide
between close values such as p = .1 and p = 0.100001, because the latter two values
for p can explain any given observation almost equally well. Thus, the relatively large
indifference ranges used by the first pair of SPRT sub-tests in the end-biased test tends to
allow values below p/2 or above p + 1−p2
to accepted or rejected very quickly.
The drawback of using a large region of indifference is that if π falls within either
test’s region of indifference, then the test can produce an arbitrary result that is not
governed by the test’s false positive and false negative parameters. Fortunately, since we
choose the region of indifference so that it always falls entirely below p + ε in the rejection
case (or above p − ε in the accept case), this will not cause problems in terms of the
correctness of the test. For example, in the rejection case, if H1 is accepted for an object
whose π value happens to fall in the region of indifference, then we do not immediately
(incorrectly) accept the object as an actual query result—rather, we will then run the
second SPRT to determine if the object should actually be accepted. The real problem
with an erroneous H1 for an object in the region of indifference means that the object is
not immediately pruned and we will need to do more work.
If an object is neither accepted or rejected by the first pair of tests, then a second pair
of tests must be run. This time, however, the size of the region of indifference is shrunk by
a fraction of 12
for both the rejection test and the acceptance test. This means that more
samples will probably be required to arrive at a result in either test—due to the fact that
H0 and H1 will be closer to one another—but it also means that fewer objects will have π
values that fall in either test’s region of indifference. Specifically, the third SPRT that is
run is used to determine possible rejection using H0 : π = 3p/4 versus H1 : π = p+ ε. If the
SPRT accepts H0, then obj is immediately rejected. However, if the third SPRT accepts
99
H1, then a second test for acceptance (the fourth test overall) is run. This test checks
H0 : π = p − ε against H1 : π = p + 1−p4
. If the SPRT accepts H1, then obj is accepted,
otherwise, a third pair of tests are run, and so on.
This process is repeated, shrinking the region of indifference each time, until one of
two things happens:
1. The process terminates with either an accept or a reject in some test, or;
2. The space of possible π values for which the process would not have terminated fallsstrictly in the range from p − ε to p + ε. In this case, an arbitrary result can bechosen.
The sequence of SPRT tests that are run is illustrated above in Figure 5-3. At each
iteration, the region of indifference shrinks, until it becomes vanishingly small and the
test terminates. Since a large initial region of indifference means that the first few tests
terminate quickly (but will only accept or reject large or small values of π), the test
is “end-biased”; that is, it is biased towards terminating early in those cases where π
is either small or large. For those values that are closer to p, more sub-tests and more
samples will be required—which is very different from classical tests such as the SPRT or
Neyman test, which try to optimize for the case when π is close to p± ε.
α, β, and the MHT problem. One thing that we have ignored thus far is how to
choose α′ and β′ (that is, the false negative and false positive rate of each individual SPRT
subtest) so that the overall, user-supplied α and β values are respected by the end-biased
test. This is a bit more difficult than it may seem to be at first glance: one significant
result of running a series of SPRTs is that it becomes imperative that we be very careful
not to accidentally accept or reject an object due to the fact that we are running multiple
hypothesis tests.
We begin our discussion by assuming that the limit on the number of pairs of tests
run is n; that is, there are n tests that can accept obj, and there are n tests that can reject
obj. We also note that in practice, the 2n tests are not run in sequence, but they are run
100
in parallel; this is done so that all of the tests can make use of the same samples, and thus
samples can be re-used and the total number of samples is minimized (see Algorithm 5-3
below and the accompanying discussion). Specifically, first we use obj.GetInstance () to
generate one sample from the underlying distribution, then we feed this sample to each
of the 2n tests. If any one of the n acceptance tests accepts the object, then the overall
end-biased test accepts the object; if any one of the n rejection tests rejects the object,
then the overall end-biased test rejects the object.
Given this setup, imagine that there is an object obj that should be accepted by the
end-biased test. We ask the question, “what is the probability β that we will falsely reject
obj?” This can be computed as:
β = Pr[n∨
i=1
(reject in rejection test i | no prior accept)]
In this expression, “no prior accept” means that no test for acceptance of obj terminated
with an accept before test i incorrectly rejected. We can then upper-bound β by simply
removing this clause:
β ≤ Pr[n∨
i=1
(reject in test i]
The reason for this inequality is that by removing any restriction on the set of outcomes
accepted by the inner boolean expression, the probability that any event is accepted by
the expression can only increase. Furthermore, by Bonferroni’s inequality [112], we have:
β ≤ Pr[n∨
i=1
reject in test i] ≤n∑
i=1
Pr[reject in test i]
As a result, if we run each individual rejection test using a false reject rate of β′, we know
that:
β ≤ Pr[n∨
i=1
reject in test i] ≤ n× β′
Thus, by choosing β′ = β/n, we correctly bound the false negative rate of the overall
end-biased test. A similar argument holds for the false positive rate: by choosing a rate of
101
α′ = α/n for each individual test, we will correctly bound the overall false positive rate of
the test.
The Final Algorithm. Given all of these considerations, pseudo-code for the end-biased
test is given in Algorithm 5-3.
Algorithm 11 EndBiased (obj, pred, p, ε, α, β)
1: rejIndLo = p2; accIndHi = 1−p
2
2: numTests = 03: while (p - rejIndLo < p−ε) or (p + accIndHi > p+ε) /* first, count the number of
tests */ do4: numTests++
5: rejIndLo /= 2; accIndHi /= 2
6: end while7: for i = 1 to numTests /* now, set up the tests */ do
8: rejSPRTs[i].Init (p− p× 12
i, p + ε, α/numTests, β/numTests)
9: accSPRTs[i].Init (p− ε, p + (1− p)× 12
i, α/numTests, β/numTests)
10: end for11: while any test is still going /* run them all */ do12: sam = pred(obj.GetInstance())13: for i = 1 to numTests do14: if rejSPRTs[i].AddSam (sam) == REJECT then15: return REJECT
16: end if17: if accSPRTs[i].AddSam (sam) == ACCEPT then18: return ACCEPT
19: end if20: end for21: end while22: return ACCEPT
This algorithm assumes two arrays of SPRTs, where the elements of each array
function just like the classic SPRT from Algorithm 5-3. The only difference is that the
various SPRTs are first initialized (via a call to Init) and then fed true/false results
one-at-a-time, via calls AddSam()—that is, they do not operate independently. The array
rejSPRTs[] attempts to reject obj; the array accSPRTs[] attempts to accept obj.
For simplicity, in Algorithm 5-3, each sample is added to each and every SPRT in
turn. In practice, this can be implemented more efficiently in a way that produces a
102
statistically equivalent outcome. First, we run rejSPRTs[0] to completion; if this SPRT
does not reject, then accSPRTs[0] picks up where the first SPRT left off (using its final
count of accepted samples) and runs to completion. If this SPRT does not accept, then
rejSPRTs[1] picks up where the second one left off and also runs to completion. This is
repeated until any member of rejSPRTs[] rejects, any member of accSPRTs[] accepts, or
all SPRTs complete.
5.4 Indexing the End-Biased Test
The end-biased test can easily be used to answer a selection query over a database
table: apply the test to each database object, and add the object to the output set
if it is accepted. However, this can be costly if the underlying database is large. One
of the longest-studied problems in database systems is how to speed such selection
operations—particularly in the case of very selective queries—via indexing. Fortunately,
the end-biased test is amenable to indexing, which is the issue we consider in this section.
Specifically, we consider the problem of indexing for queries where the spatial location of
an object is represented by the user-defined sampling function GetInstance(), because
spatial and temporal data is one of the most obvious application areas for probabilistic
selection.
5.4.1 Overview
The basic idea behind our proposed indexing strategy is as follows:
1. First, during an off-line pre-computation phase, we obtain, from each databaseobject, a sequence of samples. Those samples (or at least a summary of the samples)are stored within in an index to facilitate fast evaluation of queries at a later time.
2. Then, when a user asks a query with a specific α, β, p, and a range predicate pred(),the first step is to determine how many samples would need to be taken in order toreject any given object by the first rejection SPRT in the end-biased test, if pred()evaluated to false for each and every one of those samples. This quantity can becomputed as:
minSam = b log ( β′1−α′ )
log (1−p−ε1−p+ε
)c
103
Figure 5-4. Building the MBRs used to index the samples from the end-biased test.
3. Once numSam is obtained, the index is used to answer the question: “Whichdatabase objects could possibly have one of the first numSam samples in itspre-computed sequence accepted by pred()?” All such database objects are placedwithin a candidate set C. All those object not in C are implicitly rejected.
4. Finally, for each object within C, an end-biased test is run to see whether the objectis actually accepted by the query.
In the following few subsections, as we discuss some of the details associated with each of
these steps.
5.4.2 Building the Index
The first issue we consider is how to construct the index for the pre-computed
samples. For each database object having d attributes that may be accessed by a spatial
range query, index construction results in a series of (d + 1)-dimensional MBRs (minimum
bounding rectangles) being inserted into a spatial index such as an R-Tree. Each MBR has
a lower and upper bound for each of the d probabilistic attributes to be indexed, as well as
a lower bound b′ and an upper bound b on a sample sequence number. Specifically, if an
MBR associated with object obj has a bound (b′, b) and rectangle R, this means the first b
pre-computed samples produced via calls to obj.GetInstance() all fell within R.
104
In addition, a pseudo-random seed value S is also stored along with the MBR3 . S is
the seed used used to produce all obj.GetInstance() values starting at sample number
b′. Storing this pair is of key importance. As we describe subsequently, during query
evaluation S can be used to re-create some of the samples that were bounded within R.
Given this setup, to construct a series of MBRs for a given object, the following
method is used. For a given number of pre-computed samples m we first store the pair
(S, b′) where S is the initial pseudo-random number seed, and b = 14 . We then use S to
obtain two initial samples and bound them using the rectangle R. After this initialization,
the following sequence of operations is repeated until m samples have been taken:
1. First, we obtain as many samples as are needed until a new sample is obtained thatcannot fit into R.
2. Let b be the current number of samples that have been obtained. Create a (d +1)-dimensional MBR using R along with the sequence number pair (b′, b − 1), andinsert this MBR along with the current S and the object identifier into the spatialindex.
3. Next, update R by expanding it so that it contains the new sample. Update S to bethe current random number seed, and set b′ = b.
4. Repeat from step (1).
This process is illustrated pictorially above in Figure 5-4, for a series of one-dimensional
random values, up to b = 16. In this example, we begin by taking two samples during
initialization. We then keep sampling until the fifth sample, which is the first one
3 Since true randomness is difficult and expensive to create on a computer, virtually allapplications using Monte Carlo methods make use of pseudo-random number generation[108]. To generate a pseudo-random number, a string of bits (called a seed) is first sentas an argument to a function that uses the bits to produce the next random value. As aside-effect of producing the new random value, the seed itself is updated. This updatedseed value is then saved and used to produce the next random value at a later time.
4 m would typically be chosen to be just large enough so that with any reasonable,user-supplied query parameters, it would always be possible to reject a database objectwhere pred(obj.GetInstance) evaluated to false m times in a row.
105
Figure 5-5. Using the index to speed the end-biased test
that does not fit into the initial MBR. This completes to step (1) above. Then, a
two-dimensional MBR is created to bound the sample sequence range from 1 to 4, as
well as the set of pseudo-random values that have been observed. This MBR is inserted
(along with S) into the spatial index as MBR 1 (step (2)). Next, the fifth sample is used
to enlarge the MBR (step (3)) More samples are taken until it is found that the eighth
sample does not fit into the new MBR (back to step (1)). Then, MBR 2 is created to
cover the first seven samples as well as the sequence range from 5 to 7, and inserted into
the spatial index. The process is repeated until all m samples have been obtained. The
process can be summed up as follows: every time that a new sample forces the current
MBR to grow, a copy of the MBR is first inserted into the index, and then the MBR is
expanded to accommodate the sample.
5.4.3 Processing Queries
To use the resulting index to process a range query R encoded by the predicate
pred(), the minimum number of samples required to reject minSam is first computed as
described in Section 4.1. Then, a query Q is issued to the index searching for all MBRs
intersecting R as well as the sample sequence range from 1 to minSam. Due to the way
that the MBRs inserted into the index were constructed, we know that any database
106
object obj that does not intersect Q can immediately (and implicitly) be rejected, because
the MBR covering the first minSam samples from obj.GetInstance() did not intersect R.
However, we must still deal with those objects that did have an MBR intersecting
Q. For those objects, we run a modified version of the end-biased test that “skips ahead”
as far as possible in the sample sequence—the details of this modified test are not too
hard to imagine, and are left out for brevity. For a given intersecting object, we find
the MBR with the lowest sample sequence range that intersected Q. For example,
consider Figure 5-5; in this example, the MBRs from Object 1 intersect Q. We choose the
“earlier” of the two MBRs, which is the MBR covering the sample sequence range from
6 through 9. Let b′ be the low sample sequence value associated with this MBR, and let
S be the pseudo-random seed value associated with it. To run the modified end-biased
test, we use Algorithm 5-3, as well as the fact that none of the first b′ − 1 samples from
obj.GetInstance() could have been accepted by pred(). Thus, we initialize each of the 2n
SPRTs with b′ − 1 false samples, and start execution at the bth sample. In this way, we
skip immediately to the first sample sequence number that was likely accepted by pred().
5.5 Experiments
In this section, we experimentally evaluate our proposed algorithms. Specific
questions we wish to answer are:
1. In a realistic environment where a selection predicate must be run over millions ofdatabase objects, how well do standard methods from statistics perform?
2. Can our end-biased test improve upon methods from the statistical literature?
3. Does the proposed indexing framework effectively speed application of the end-biasedtest?
Experimental Setup. In each of our experiments, we consider a simple, synthetic
database, which allows us to easily test the sensitivity of the various algorithms to
different data and query characteristics. This database consists of two-dimensional
Gaussians spread randomly throughout a field. For a number of different (query, database)
107
combinations, we measure the wall-clock execution time required to discover all of the
database objects that fall in a given, rectangular box with probability p. Since in all cases
the database size is large relative to the query result size, the false positive rate we allow is
generally much lower than the false negative rate. The reason is that a false positive rate
of 1% over a 10-million object database could theoretically result in 1× 105 false positives.
Thus, in all of our queries, we use a false positive rate of (number of database object)−1 an
a false negative rate of 10−2 since the average result size is quite small, so a relatively high
false drop rate is acceptable. The ε value we use in our experiments is 10−5.
Given this setup, there are four different parameters that we vary to measure the
effect on query execution time:
1. dbSize: the size of the database, in terms of the number of objects.
2. stdDev: the standard deviation of an average Gaussian stored in the database,along each dimension. Since this controls the spread of each object’s PDF, if thevalue is small, then effectively the database objects are all very far apart from eachother. As the value grows, the objects effectively get closer to one another, until theyeventually overlap.
3. qSize: this is the size of the query box.
4. p: this is the user-supplied p value that is used to accept or reject objects.
We run four separate tests:
1. In the first test, stdDev is fixed at 10% of the width of the field, qSize along eachaxis is fixed at 3% of the width of the field, and p is set at 0.8. Thus, many databaseobjects intersect each query, but likely none are accepted. dbSize is varied from 106
to 3× 106 to 5× 106 to 107.
2. In the second test, dbSize is fixed at 107. stdDev is again fixed at 1%, and p is 0.95.qSize is varied from 0.3% to 1% to 3% to 10% along each axis. In the first case,most database object intersecting the query region are accepted; in the latter, noneare since the object’s spread is much greater than the query region.
3. In the third test, dbSize is fixed at 107, qSize is 3%, p = 0.8, and stdDev is variedfrom 1% to 3% to 10%.
4. In the final test, dbSize is 107, qSize is 3%, stdDev is 10%, and p is varied from 0.8to 0.9 to 0.95. The first case is particularly difficult because while very few objects
108
Method 106 3× 106 5× 106 107
SPRT 568 sec 1700 sec 2824 sec 5653 secOpt 2656 sec 8517 sec 14091 sec 26544 sec
End-biased 9 sec 24 sec 38 sec 76 secIndexed 1 sec 3 sec 7 sec 15 sec
Table 5-1. Running times over varying database sizes.
Method 0.3% 1% 3% 10%SPRT 1423 sec 1420 sec 1427 sec 3265 sec
End-biased 76 sec 75 sec 75 sec 430 secIndexed 11 sec 4 sec 4 sec 962 sec
Table 5-2. Running times over varying query sizes.
Method 1% 3% 10%SPRT 5734 sec 5608 sec 5690 sec
End-biased 116 sec 75 sec 75 secIndexed 107 sec 12 sec 15 sec
Table 5-3. Running times over varying object standard deviations.
Method 0.8 0.9 0.95SPRT 5672 sec 2869 sec 1436 sec
End-biased 75 sec 75 sec 75 secIndexed 14 sec 12 sec 13 sec
Table 5-4. Running times over varying confidence levels.
are accepted, the spread of each object is so great that most are candidates foracceptance.
Each test is run several times, and results are averaged across all runs.
Methods Tested. For each of the above tests, we test four methods: the SPRT, an
alternative sequential test that is approximately, asymptotically optimal [113], the
end-biased test via sequential scan, and the end-biased test via indexing. In practice, we
found the optimal test to be so slow that it was only used for the first set of tests.
Results. The results are given in Tables 5-1 through 5-4. All times are wall-clock running
times in seconds. The raw data files for a database of size 107 required about 500MB of
storage. The indexed, pre-sampled version of this data file requires around 7GB to store in
its entirety if 500 samples are used.
Discussion.
There are several interesting findings. First and foremost is the terrible relative
performance of the “optimal” sequential test, which was generally about five times slower
109
than Wald’s classic SPRT. The results are so bad that we removed this option after
the first set of experiments. Since we were quite curious about the poor performance,
we ran some additional, exploratory experiments using the optimal test and found
that it can be better than the SPRT, particularly in cases where H0 and H1 were far
apart. Unfortunately, in our application ε is chosen to be quite small and under such
circumstances the optimal test is quite useless. The poor results associated with this test
illustrate quite strongly how asymptotic statistical theory is often quite far away from
practice.
On the other hand, the end-biased test always far outperformed the SPRT, sometimes
by almost two orders of magnitude. This is perhaps not surprising given the fact that,
by design, the end-biased test can quickly reject those multitude of objects where π ≈0. The spread between the two tests was particularly significant for the third set of
experiments, which tests the effect of the object standard deviation (or size) on the
running time. It was interesting that the end-biased test performed better with higher
standard deviation—as the object size increases, fewer objects are accepted by the query
box, which cannot encompass enough of the PDF. The end-biased test appears to be
particularly adept at rejecting such objects quickly. However, the SPRT performance
seemed to be invariant to the size of the object.
Another interesting finding was the sensitivity of the SPRT to the p parameter.
For lower confidence, the test was far more expensive. This is because as the confidence
is lowered, the actual probability that an object is in the query box gets closer to the
user-defined input parameter. As this happens, the SPRT has a harder time rejecting the
object.
The results regarding the index were informative as well. It is not surprising that
the indexed method was almost always the fastest for the four. For the 10 million record
database, it seems that the standard end-biased test bottoms with at a sequential scan
plus processing time of about 70 seconds. However, the indexed, end-biased method is able
110
to cut this baseline cost down to under ten seconds for the same database size—though
the time taken from query to query tended to vary a lot more for the index than the other
methods.
It is interesting that in the cases where the regular end-biased test becomes more
expensive than its 70 second baseline, (for example, consider the first column in Table 5-3)
the indexed version also suffers to almost the same extent. This is not surprising. The
reason for an increased cost for the un-indexed version is that a large number of objects
were encountered that required many samples. Perhaps there were even a few objects
that required an extreme number of samples—numbers in the billions happen occasionally
when π is very close to p. The indexed version is no better than the un-indexed version
in this case; it cannot dismiss such an expensive object outright using the index, and the
few, pre-computed, indexed samples it has access to are useless if millions of samples are
eventually required to accept or reject the object.
Perhaps the most interesting case with respect to the index is the fourth column of
Table 5-2, where the indexed end-biased test actually doubles the running time of the
un-indexed version. The explanation here seems to be that this particular query has
the largest result set size. The size is so large, in fact, that use of the index induces an
overhead and actually slows query evaluation—a phenomenon that is always possible in
database indexing.
5.6 Related Work
Since the SPRT was first proposed by Wald in 1947 [111], sequential statistics
have been widely studied. Wald’s original SPRT is proven to be optimal for values
lying exactly at H0 or H1; in other cases, it may perform poorly. Kieffer and Weiss
first raised the question of developing tests that are provably optimal at points other
than those covered by H0 and H1 [114]. However, in the general case, this problem has
not been solved, though there has been some success in solving the asymptotic case
where (α = β) → 0. Such a solution was first proposed by Lorden in 1976 [115] where
111
he showed that Keifer-Weiss optimality can be achevied asymptotically under certain
conditions. Well-known follow-up work is due to Eisenberg [116], Huffman [113], and
Pavlov [117]. Work in this area continues to this day. However, a reasonable criticism
of much of this work is its focus on asymptotic optimality—particularly its focus on
applications having vanishingly small (and equal) values of α and β. It is unclear how
well such asymptotically optimal tests perform in practice, and the statistical research
literature provides surprisingly little in the way of guidance, which was our motivation
for undertaking this line of work. In our particular application, α and β are not equal (in
fact, they will most often differ by orders of magnitude), and it is unclear to us whether
practical and clearly superior alternatives to Wald’s original proposal exist in such a case.
In contrast to related work pure and applied statistics, we seek no notion of optimality;
our goal is to design a test that is (a) correct, and (b) experimentally proven to work
well in the case where ε is vanishingly small, and yet more often than not, the “true”
probability of object inclusion is either zero or one.
In the database literature, the paper that is closest to our indexing proposal is due to
Tao et al. [118]. They consider the problem of indexing spatial data where the position is
defined by a PDF. However, they assume that the PDF is non-infinite, and integratable.
The assumption of finiteness may be reasonable for many applications (since many
distributions, such as the Gaussian, fall off exponentially in density as distance from the
mean increases). However, integratability is a strong assumption, precluding, for example,
many distributions resulting from Bayesian inference [62] that can only be sampled from
using MCMC methods [107].
Most of the work in probabilistic databases is at least tangentially related to our own,
in that our goal is to represent uncertainty. We point the reader to Dalvi and Suciu for a
general treatment of the topic [119]. The paper most closely related to this work is due to
Jampani et al. [109] who propose a data model for uncertain data that is fundamentally
based upon Monte Carlo.
112
5.7 Summary
In this chapter, we have considered the problem of using Monte Carlo methods to
test which object in a probabilistic database are accepted by a query predicate. Our two
primary contributions are (1) the definition of a new sequential test of whether or not the
probability that an object is accepted by the query predicate which strictly controls both
the false positive and false negative rates, and (2) an indexing methodology for the test.
The test was found to work quite well in practice, and the indexing is very successful in
speeding the test’s application in practice.
We close the chapter by pointing out that our goal was not to make a contribution to
statistical theory, and arguably, we have not! Most of the relevant statistical literature is
concerned with various definitions of optimality, and while our new test is correct, there
is no sense in which it is optimal. However, we believe that our new test is of practical
significance to the implementation of probabilistic databases. The experimental evidence
that it works well is strong, and there is also strong intuition behind the design of the
test. In practice, the new test outperforms an oft-cited “optimal” test from the relevant
statistical literature for database selection problems, and while a more appropriate test
may exist, we are unaware of a more suitable candidate for solving the problem at hand.
113
CHAPTER 6CONCLUDING REMARKS
With the increasing popularity of tracking devices, and the decreasing cost of storage,
large spatiotemporal data collections are becoming increasingly more commonplace.
Extending current database systems to support such collections require the development of
new solutions. The work presented in this study represents a small step in that direction.
The main theme of this work was on developing scalable and efficient algorithms for
processing historical spatiotemporal data, particularly in a warehouse context. As much
as this work solves some important research issues, it also opens new avenues for future
research. Some potential directions for further development include:
• The CPA-join focused on historical queries over two spatiotemporal relations.Extending the work to support predictive queries would be an interesting exercise.Unlike historical queries that span long time intervals, predictive queries are ofteninterested in short time intervals. This could make the use of indexes potentiallymore attractive.
• The version of entity resolution considered in this work assumed simple binarysensors that provide limited information about the tracked objects. This howeverlimits the ability of the algorithms to discriminate between closely moving objects.The accuracy however could be improved if one considers sensors that provide aricher feature set (such as sensors that provide additional color information). Thiswould provide the algorithms an additional dimension to differentiate the observationclouds.
• Finally, we focused only on answering probabilistic spatiotemporal selection queriesusing the end-biased test. However, statistical hypothesis testing, on which theend-biased test is built upon, is a basic technique used in many fields of science andengineering. Hence, the end-biased algorithm proposed in this work has potentiallybroad applicability besides probabilistic databases.
114
REFERENCES
[1] R. Guting and M. Schneider, Moving Object Databases, Morgan Kaufmann, 2005.
[2] D. Papadias, D. Zhang, and G. Kollios, Advances in Spatial and Spatio TemporalData Management, Springer-Verlag, 2007.
[3] J.Schillier and A.Voisard, Location-Based Services, Morgan Kaufmann, 2004.
[4] Y.Zhang and O.Wolfson, Satellite-based information services, Kluwer AcademicPublishers, 2002.
[5] W.I.Grosky, A. Kansal, S. Nath, J. Liu, and F.Zhao, “Senseweb: An infrastructurefor shared sensing,” in IEEE Multimedia, 2007.
[6] Cover, “Mandate for change,” RFID Journal, 2004.
[7] G.Abdulla, T.Critchlow, and W.Arrighi, “Simulation data as data streams,”SIGMOD Record 33(1):89-94, 2004.
[8] N. Pelekis, B. Theodoulidis, I. Kopanakis, and Y. Theodoridis, “Literature review ofspatio-temporal database models,” in The Knowledge Engineering Review, 2004.
[9] A.P.Sistla, O.Wolfson, S.Chamberlain, and S.Dao, “Modeling and querying movingobjects,” in ICDE, 1997.
[10] M. Erwig, R. Guting, M. Schneider, and M. Vazirgianni, “A foundation forrepresenting and querying moving objects,” in TODS, 2000.
[11] L.Forlizzi, R.H.Guting, E.Nardelli, and M.Schneider, “A data model and datastructures for moving objects databases,” in SIGMOD, 2000.
[12] C.Parent, S.Spaccapietra, and E.Zimanyl, “Spatiotemporal conceptual models: Datastructures + space + time,” in GIS, 1999.
[13] N.Tryfona, R.Price, and C.S.Jensen, “Conceptual models for spatiotemporalapplications,” in The CHOROCHRONOS Approach, 2002.
[14] E. Tossebro, “Representing uncertainty in spatial and spatiotemporal databases,” inPhd Thesis, 2002.
[15] M. Erwig and S.Schneider, “Stql: A spatiotemporal query language,” in Miningspatio-temporal information systems, 2002.
[16] R.Guttman, “R-trees: a dynamic index structure for spatial searching,” in SIGMOD,1984.
[17] S. Saltenis, C. Jensen, S. Leutengger, and M. Lopez, “Indexing the positions ofcontinuously moving objects,” in SIGMOD, 2000.
115
[18] P. Chakka, A. Everspaugh, and J. Patel, “Indexing large trajectory data sets withSETI,” in CIDR, 2003.
[19] J.Patel, Y.Chen, and P.Chakka, “Stripes: An efficient index for predictedtrajectories,” in SIGMOD, 2004.
[20] S. Theodoridis, “Spatio-temporal Indexing for Large Multimedia Applications,” inIEEE Int’l Conference on Multimedia Computing and Systems, 1996.
[21] D. Pfoser, C. S. Jensen, and Y. Theodoridis, “Novel approaches to the indexing ofmoving object trajectories,” in VLDB, 2000.
[22] T. Tzouramanis, M. Vassilakopoulos, and Y. Manolopoulos, “Overlapping linearquadtrees: A spatio-temporal access method,” in Advances in GIS, 1998.
[23] G.Kollios, D.Gunopulos, and V.J.Tsotras, “Nearest neighbor queries in a mobileenvironment,” in Spatiotemporal database management, 1999.
[24] Z.Song and N.Roussopoulos, “K-nearest neighbor search for moving query point,” inSymp. on Spatial and Temporal Databases, 2001.
[25] Z.Huang, H.Lu, B. Ooi, and A. Tung, “Continuous skyline queries for movingobjects,” in TKDE, 2006.
[26] G.Kollios, D.Gunopulos, and V.J.Tsotras, “An improved R-tree indexing fortemporal spatial databases,” in SDH, 1990.
[27] Y.Tao and D.Papadias, “Mv3r-tree: A spatiotemporal access method for timestampand interval queries,” in VLDB, 2001.
[28] M.A.Nascimento and J.R.O.Silva, “Towards historical R-trees,” in ACM SAC, 1998.
[29] G.Iwerks, H.Samet, and K.P.Smith, “Maintenance of spatial semijoin queries onmoving points,” in VLDB, 2004.
[30] S.Arumugam and C.Jermaine, “Closest-point-of-approach join over moving objecthistories,” in ICDE, 2006.
[31] Y.Choi and C.Chung, “Selectivity estimation for spatio-temporal queries to movingobjects,” in SIGMOD, 2002.
[32] M.Schneider, “Evaluation of spatio-temporal predicates on moving objects,” inICDE, 2005.
[33] Y.Tao, J.Sun, D.Papadias, and G.Kollios, “Analysis of predictive spatio-temporalqueries,” in TODS, 2003.
[34] J.Sun, Y.Tao, D.Papadias, and G.Kollios, “Spatiotemporal join selectivity,” inInformation Systems, 2006.
116
[35] M. Vlachos, G. Kollios, and D. Gunopulos, “Discovering similar multidimensionaltrajectories,” in ICDE, 2002.
[36] J.Kubica, A.Moore, A.Connolly, and R.Jedicke, “A multiple tree algorithm for theefficient association of asteroid observations,” in KDD, 2005.
[37] S. Gaffney and P. Smyth, “Trajectory Clustering with Mixtures of RegressionModels,” in KDD, 1999.
[38] Y.Li, J.Han, and J.Yang, “Clustering Moving Objects,” in KDD, 2004.
[39] J.Lee, J.Han, and K.Whang, “Trajectory clustering: A partition-and-groupframework,” in SIGMOD, 2007.
[40] D.Guo, J.Chen, A. MacEachren, and K.Liao, “A visualization system for space-timeand multivariate patterns,” in IEEE Transactions on Visualization and ComputerGraphcis, 2006.
[41] D. Papadias, Y. Tao, P. Kalnis, and J. Zhang, “Indexing spatio-temporal datawarehouses,” in ICDE, 2002.
[42] N.Mamoulis, H.Cao, G.Kollios, M.Hadjieleftheirou, Y.Tao, and D.Cheung, “Mining,indexing, and querying historical spatiotemporal data,” in KDD, 2004.
[43] Y.Tao, G.Kollios, J.Considine, F.Li, and D.Papadias, “Spatio-temporal aggregationusing sketches,” in ICDE, 2004.
[44] D. Papadias, Y.Tao, P.Kalnis, and J.Zhang, “Historical spatio-temporalaggregation,” in Trans. of Information Systems, 2005.
[45] T.Brinkhoff, H.P.Kriegel, and B.Seeger, “Efficient processing of spatial-joins usingR-trees,” in SIGMOD, 1993.
[46] Y.W.Huang, N.Jing, and E.A.Rundensteiner, “Spatial joins using R-trees:Breadth-first traversal with global optimizations,” in VLDB, 1997.
[47] M. Lo and C.V.Ravishankar, “Spatial hash joins,” in SIGMOD, 1996.
[48] J. Patel and D. DeWitt, “Partition based spatial-merge join,” in SIGMOD, 1996.
[49] L. Arge, O.Procopiu, and S. T. J.S.Vitter, “Scalable sweeping-based spatial join,” inVLDB, 1998.
[50] M. Berg, M. Kreveld, M.Overmars, and O.Schwarzkopf, Computational Geomtery:Algorithms and Applictions, Springer-Verlag, 2000.
[51] S.H.Jeong, N.W.Paton, A. Fernandes, and T. Griffiths, “An experimentalperformance evaluation of spatio-temporal join strategies,” in Transactions inGIS, 2004.
117
[52] W. Winkler, “Matching and record linkage,” in Business Survey Methods, 1995.
[53] M.Hernandez and S.Stolfo, “The merge/purge problem for large databases,” inSIGMOD, 1995.
[54] C. E. A. Monge, “The field matching problem: Algorithms and applications,” inKDD, 1996.
[55] W. Cohen and J. Richman, “Learning to match and cluster large high-dimensionaldata sets for data integration,” in KDD, 2002.
[56] Y.Bar-Shalom and T.Fortmann, “Tracking and data association,” in AcademicPress, 1988.
[57] B.Ristic, S.Arulampalam, and N.Gordon, “Beyond the kalman filter: Particle filtersfor tracking applications,” in Artech House Publishers, 2004.
[58] D.B.Reid, “An algorithm for tracking multiple targets,” in IEEE Trans. Automat.Control, 1979.
[59] X.Li, “The pdf of nearest neighbor measurement and a probabilistic nearestneighbor filter for tracking in clutter,” in IEEE Control and Decision Conference,1993.
[60] I. Cox and S.L.Hingorani, “An efficient implentation of reid’s multiple hypothesistracking alogrithm and its evaluation for the purpose of Visual Tracking,” in Intl.Conf. on Pattern Recognition, 1994.
[61] T.Song, D.Lee, and J.Ryu, “A probabilistic nearest neighbor filter algorithm fortracking in a clutter environment,” in Signal Processing, Elsevier Science, 2005.
[62] A. O’Hagan and J. J. Forster, Bayesian Inference, Volume 2B of Kendall’s AdvancedTheory of Statistics. Arnold, second edition, 2004.
[63] A. Doucet, C. Andrieu, and S. Godsill, “On sequential monte carlo samplingmethods for bayesian filtering,” Statistics and Computing, vol. 10, pp. 197–208, 2000.
[64] D.Fox, J.Hightower, L.Liao, D.Schulz, and G.Borriello, “Bayesian filtering forlocation estimation,” in IEEE Pervasive Computing, 2003.
[65] Z.Khan, T.Balch, and F.Dellaert, “An mcmc-based particle filter for mulitipleinteracting targets,” in ECCV, 2004.
[66] S. Oh, S. Russell, and S. Sastry, “Markov Chain Monte Carlo data association forgeneral multiple-target tracking problems,” in IEEE Conf. on Decision and Control,2004.
[67] O.Wolfson, S.Chamberlain, S.Dao, L.Jiang, and G.Mendez, “Cost and imprecision inmodeling the precision of moving objects,” in ICDE, 1998.
118
[68] D.Pfoser, “Capturing the uncertainty of moving objects,” in LNCS, 1999.
[69] J.H.Hosbond, S.Saltenis, and R.Ortfort, “Indexing uncertainty of continuouslymoving objects,” in IDEAS, 2003.
[70] C.Trajcevski, O.Wolfson, K.Hinrichs, and S.Chamberlain, “Managing uncertainty inmoving object databases,” in TODS, 2004.
[71] R.Cheng, D.Kalashikov, and S.Prabhakar, “Querying imprecise data in movingobject environments,” in TKDE, 2004.
[72] Y.Tao, R.Cheng, and X.Xiao, “Indexing multidimensional uncertain data witharbitrary probability density functions,” in VLDB, 2005.
[73] H.Mokhtar and J.Su, “Universal trajectory queries on moving object databases,” inMobile Data Management, 2004.
[74] D. Eberly, 3D Game Engine Design: A Practical Approach to Real-time ComputerGraphics, Morgan Kaufmann, 2001.
[75] M. Mokbel, X. Xiong, and W. Aref, “SINA: Scalable incremental processing ofcontinuous queries in spatio-temporal databases,” in SIGMOD, 2004.
[76] Y. Tao, “Time-parametrized queries in spatio-temporal databases,” in SIGMOD,2004.
[77] S. Saltenis and C. Jensen, “Indexing of moving objects for location-based services,”in ICDE, 2002.
[78] Y. Tao, D. Papadias, and J. Sun, “The TPR*-tree: An optimized spatio-temporalaccess method for predictive queries,” in VLDB, 2003.
[79] O. Gunther, “Efficient computation of spatial joins,” in ICDE, 1993.
[80] S. Leutenegger and J.Edgington, “STR: A simple and efficient algorithm for R-treepacking,” in 13th Intl. Conf. on Data Engineering (ICDE), 1997.
[81] D.Mehta and S.Sahni, Handbook of Data Strutures and Its Applications, Chapmanand Hall, 2004.
[82] P.J.Haas and J.M.Hellerstein, “Ripple joins for online aggregation,” in SIGMOD,1999.
[83] M. Nascimento and J. Silva, “Evaluation of access structures for discretely movingpoints,” in Int’l Workshop on Spatio-Temporal Database Management, 1999.
[84] A.Dempster, N.Laird, and D.Rubin, “Maximum likelihood estimation fromincomplete data via the em,” in Journ. Royal Statistical Society, 1977.
119
[85] J.Bilmes, “A gentle tutorial of the em algorithm and its application to parameterestimation for gaussian mixture and hidden markov models,” in Technical Report,Univ. of Berkeley, 1997.
[86] J.Banfield and A.Raftery, “Model-based gaussian and non-gaussian clustering,” inBiometrics, 1993.
[87] J.Oliver, R.Baxter, and C.Wallace, “Unsupervised learning using mml,” in ICML,1996.
[88] M.Hansen and B. Yu, “Model selection and the principle of minimum descriptionlength,” in Journal of the American Statistical Association, 1998.
[89] M. Figueiredo and A. Jain, “Unsupervised learning of finite mixture models,” inIEEE Trans. on Pattern Analysis and Machine Intelligence, 2002.
[90] R. Baxter, “Minimum message length inference: Theory and applications,” in PhDThesis, 1996.
[91] G.Celeux, S.Chretien, F.Forbes, and A.Mikhadri, “A component-wise em algorithmfor mixtures,” in Journ. of Computational and Graphical Statistics, 1999.
[92] D.Pfoser and C.Jensen, “Trajectory indexing using movement constraints,” inGeoInformatica, 2005.
[93] Y.Cai and R.Ng, “Indexing spatio-temporal trajectories with chebyshevpolynomials,” in SIGMOD, 2004.
[94] S.Rasetic and J.Sander, “A trajectory splitting model for efficient spatio-temporalindexing,” in VLDB, 2005.
[95] D.Chudova, S.Gaffney, E.Mjolsness, and P.Smyth, “Translation-invariant MixtureModels for Curve Clustering,” in KDD, 2003.
[96] H.Kriegel and M.Pfeifle, “Density-based Clustering of Uncertain Data,” in KDD,2005.
[97] J.L.Bentley, “K-d trees for semidynamic point sets,” in Annual Symposium onComputational Geometry, 1990.
[98] L.Frenkel and M.Feder, “Recursive Expectation Maximization algorithms fortime-varying parameters with applications to multiple target tracking,” in IEEETrans. Signal Processing, 1999.
[99] P. Chung, J. Bohme, and A. Hero, “Tracking of multiple moving sources usingrecursive em algorithm,” in EURASIP Journal on Applied Signal Processing, 2005.
[100] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara,and J. Widom, “Trio: A system for data, uncertainty, and lineage,” in VLDB, 2006.
120
[101] P. Andritsos, A. Fuxman, and R. J. Miller, “Clean answers over dirty databases: Aprobabilistic approach,” in ICDE, 2006, p. 30.
[102] L. Antova, C. Koch, and D. Olteanu, “MayBMS: Managing incomplete informationwith probabilistic world-set decompositions,” in ICDE, 2007, pp. 1479–1480.
[103] R. Cheng, S. Singh, and S. Prabhakar, “U-DBMS: A database system for managingconstantly-evolving data,” in VLDB, 2005, pp. 1271–1274.
[104] N. N. Dalvi and D. Suciu, “Efficient query evaluation on probabilistic databases,”VLDB J., vol. 16, no. 4, pp. 523–544, 2007.
[105] N. Fuhr and T. Rolleke, “A probabilistic relational algebra for the integration ofinformation retrieval and database systems,” ACM Trans. Inf. Syst., vol. 15, no. 1,pp. 32–66, 1997.
[106] R. Gupta and S. Sarawagi, “Creating probabilistic databases from informationextraction models,” in VLDB, 2006, pp. 965–976.
[107] C. Robert and G. Casella, Monte Carlos Statistical Methods, Springer, secondedition, 2004.
[108] J. E. Gentle, Random Number Generation and Monte Carlo Methods, Springer,second edition, 2003.
[109] R. Jampani, F. Xu, M. Wu, L. P. Ngai, C. Jermaine, and P. Hass, “Mcdb: A montecarlo approach to handling uncertianty,” in SIGMOD, 2008.
[110] J. Neyman and E. Pearson, “On the problem of the most efficient tests of statisticalhypotheses,” Phil. Tran. of the Royal Soc. of London, Series A, vol. 231, pp.289–337, 1933.
[111] A. Wald, Sequential Analysis, Wiley, 1947.
[112] J. Galambos and I. Simonelli, Bonferroni-Type Inequalities with Applications,Springer-Verlag, 1996.
[113] M. Huffman, “An efficient approximate solution to the kiefer-weiss problem,” in TheAnnals of Statistics, 1983, vol. 11, pp. 306–316.
[114] J. Kiefer and L. Weiss, “Some properties of generalized sequential probability ratiotests,” in The Annals of Mathematical Statistics, 1957, vol. 28, pp. 57–74.
[115] G. Lorden, “2-sprts and the modified keifer-weiss problem of minimizing an expectedsample size,” in The Annals of Statistics, 1976, vol. 4, pp. 281–291.
[116] B. Eisenberg, “The asymptotic solution to the keifer-weiss problem,” in Comm.Statistics C-Sequential Analysis, 1982, vol. 1, pp. 81–88.
121
[117] I. Pavlov, “Sequential procedure of testing compositie hypotheses with applicationto the keifer-weiss problem,” in Theory of Probability and Its Applications, 1991,vol. 35, pp. 280–292.
[118] Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar, “Indexingmulti-dimensional uncertain data with arbitrary probability density functions,” inVLDB, 2005, pp. 922–933.
[119] N. Dalvi and D. Suciu, “Management of probabilistic data: foundations andchallenges,” in PODS, 2007, pp. 1–12.
122
BIOGRAPHICAL SKETCH
Subramanian Arumugam is a member of the query processing team at the database
startup, Greenplum. He is a recipient of the 2007 ACM SIGMOD Best Paper Award.
He received his bachelor’s degree from the University of Madras in 2000. He obtained his
master’s in computer engineering in 2003, and his PhD in computer engineering in 2008
both from the University of Florida.
123