arumugam s

Upload: deni-chan

Post on 04-Jun-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Arumugam s

    1/123

  • 8/13/2019 Arumugam s

    2/123

    c 2008 Subramanian Arumugam

    2

  • 8/13/2019 Arumugam s

    3/123

    To my parents.

    3

  • 8/13/2019 Arumugam s

    4/123

    ACKNOWLEDGMENTS

    First of all, I would like to thank my advisor Chris Jermaine. This dissertation would

    not have been made possible had it not been for his excellent mentoring and guidance

    through the years. Chris is a terrific teacher, a critical thinker and a passionate researcher.

    He has served as a great role model and has helped me mature as a researcher. I cannot

    thank him more for that.

    My thanks also goes to Prof. Alin Dobra. Through the years, Alin has been a patient

    listener and has helped me structure and refine my ideas countless times. His excitement

    for research is contagious!

    I would like to take this opportunity to mention my colleagues at the database

    center: Amit, Florin, Fei, Luis, Mingxi and Ravi. I have had many hours of fun discussing

    interesting problems with them. Special thanks goes to my friends Manas, Srijit, Arun,

    Shantanu, and Seema, for making my stay in Gainesville all the more enjoyable.

    Finally, I would like thank my parents for being a source of constant support and

    encouragment throughout my studies.

    4

  • 8/13/2019 Arumugam s

    5/123

    TABLE OF CONTENTS

    page

    ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    CHAPTER

    1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Research Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    1.2.1 Data Modeling and Database Design . . . . . . . . . . . . . . . . . 151.2.2 Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.2.5 Data Warehousing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3.1 Scalable Join Processing over Massive Spatiotemporal Histories . . . 181.3.2 Entity Resolution in Spatiotemporal Databases. . . . . . . . . . . . 191.3.3 Selection Queries over Probabilistic Spatiotemporal Databases . . . 19

    2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.1 Spatiotemporal Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3 Probabilistic Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3 SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES . 25

    3.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.2.1 Moving Object Trajectories. . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Closest Point of Approach (CPA) Problem . . . . . . . . . . . . . . 28

    3.3 Join Using Indexing Structures . . . . . . . . . . . . . . . . . . . . . . . . 30

    3.3.1 Trajectory Index Structures . . . . . . . . . . . . . . . . . . . . . . 313.3.2 R-tree Based CPA Join . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.4 Join Using Plane-Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Basic CPA Join using Plane-Sweeping . . . . . . . . . . . . . . . . . 363.4.2 Problem With The Basic Approach . . . . . . . . . . . . . . . . . . 373.4.3 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3.5 Adaptive Plane-Sweeping. . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    5

  • 8/13/2019 Arumugam s

    6/123

    3.5.2 Cost Associated With a Given Granularity . . . . . . . . . . . . . . 413.5.3 The Basic Adaptive Plane-Sweep . . . . . . . . . . . . . . . . . . . 413.5.4 Estimating Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5.5 Determining The Best Cost. . . . . . . . . . . . . . . . . . . . . . . 443.5.6 Speeding Up the Estimation . . . . . . . . . . . . . . . . . . . . . . 46

    3.5.7 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 483.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    3.6.1 Test Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6.2 Methodology and Results . . . . . . . . . . . . . . . . . . . . . . . . 503.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    4 ENTITY RESOLUTION IN SPATIOTEMPORAL DATABASES . . . . . . . . 58

    4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.2 Outline of Our Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.1 PDF for Restricted Motion . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 PDF for Unrestricted Motion. . . . . . . . . . . . . . . . . . . . . . 65

    4.4 Learning the Restricted Model . . . . . . . . . . . . . . . . . . . . . . . . . 664.4.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 674.4.2 LearningK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.5 Learning Unrestricted Motion . . . . . . . . . . . . . . . . . . . . . . . . . 714.5.1 Applying a Particle Filter. . . . . . . . . . . . . . . . . . . . . . . . 724.5.2 Handling Multiple Objects . . . . . . . . . . . . . . . . . . . . . . . 734.5.3 Update Strategy for a Sample given Multiple Objects . . . . . . . . 75

    4.5.4 Speeding Things Up. . . . . . . . . . . . . . . . . . . . . . . . . . . 764.6 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    5 SELECTION OVER PROBABILISTIC SPATIOTEMPORAL RELATIONS . . 84

    5.1 Problem and Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.1.2 The False Positive Problem. . . . . . . . . . . . . . . . . . . . . . . 875.1.3 The False Negative Problem . . . . . . . . . . . . . . . . . . . . . . 90

    5.2 The Sequential Probability Ratio Test (SPRT) . . . . . . . . . . . . . . . . 915.3 The End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    5.3.1 Whats Wrong With the SPRT? . . . . . . . . . . . . . . . . . . . . 955.3.2 Removing the Magic Epsilon . . . . . . . . . . . . . . . . . . . . . . 965.3.3 The End-Biased Algorithm . . . . . . . . . . . . . . . . . . . . . . . 97

    5.4 Indexing the End-Biased Test . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.4.2 Building the Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    6

  • 8/13/2019 Arumugam s

    7/123

    5.4.3 Processing Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    6 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    BIOGRAPHICAL SKETCH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    7

  • 8/13/2019 Arumugam s

    8/123

    LIST OF TABLES

    Table page

    4-1 Varying the number of objects and its effect on recall, precision and runtime. . . 80

    4-2 Varying the number of time ticks. . . . . . . . . . . . . . . . . . . . . . . . . . . 804-3 Varying the number of sensors fired. . . . . . . . . . . . . . . . . . . . . . . . . 80

    4-4 Varying the standard deviation of the Gaussian cloud. . . . . . . . . . . . . . . 80

    4-5 Varying the number of time ticks where EM is applied. . . . . . . . . . . . . . . 81

    5-1 Running times over varying database sizes. . . . . . . . . . . . . . . . . . . . . . 109

    5-2 Running times over varying query sizes. . . . . . . . . . . . . . . . . . . . . . . 109

    5-3 Running times over varying object standard deviations. . . . . . . . . . . . . . . 109

    5-4 Running times over varying confidence levels. . . . . . . . . . . . . . . . . . . . 109

    8

  • 8/13/2019 Arumugam s

    9/123

    LIST OF FIGURES

    Figure page

    3-1 Trajectory of an object (a) and its polyline approximation (b) . . . . . . . . . . 28

    3-2 Closest Point of Approach Illustration . . . . . . . . . . . . . . . . . . . . . . . 293-3 CPA Illustration with trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3-4 Example of an R-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3-5 Heuristic to speed up distance computation . . . . . . . . . . . . . . . . . . . . 34

    3-6 Issues with R-trees- Fast moving objectp joins with everyone . . . . . . . . . . 35

    3-7 Progression of plane-sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    3-8 Layered Plane-Sweep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3-9 Problem with using large granularities for bounding box approximation . . . . . 40

    3-10 Adaptively varying the granularity . . . . . . . . . . . . . . . . . . . . . . . . . 42

    3-11 Convexity of cost function illustration. . . . . . . . . . . . . . . . . . . . . . . . 45

    3-12 Iteratively evaluating k cut points. . . . . . . . . . . . . . . . . . . . . . . . . . 46

    3-13 Speeding up the Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    3-14 Injection data set at time tick 2,650. . . . . . . . . . . . . . . . . . . . . . . . . 49

    3-15 Collision data set at time tick 1,500 . . . . . . . . . . . . . . . . . . . . . . . . . 50

    3-16 Injection data set experimental results . . . . . . . . . . . . . . . . . . . . 51

    3-17 Collision data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 52

    3-18 Buffer size choices for Injection data set . . . . . . . . . . . . . . . . . . . . . . 53

    3-19 Buffer size choices for Collision data set . . . . . . . . . . . . . . . . . . . . . . 53

    3-20 Synthetic data set experimental results . . . . . . . . . . . . . . . . . . . . . . . 54

    3-21 Buffer size choices for Synthetic data set . . . . . . . . . . . . . . . . . . . . . . 56

    4-1 Mapping of a set of observations for linear motion . . . . . . . . . . . . . . . . . 60

    4-2 Object path (a) and quadratic fit for varying time ticks (b-d). . . . . . . . . . . 62

    4-3 Object path in a sensor field (a) and sensor firings triggered by object motion (b) 64

    4-4 The baseline input set (10,000 observations) . . . . . . . . . . . . . . . . . . . . 79

    9

  • 8/13/2019 Arumugam s

    10/123

    4-5 The learned trajectories for the data of Figure 4-4 . . . . . . . . . . . . . . . . . 79

    5-1 The SPRT in action. The middle line is the LRT statistic. . . . . . . . . . . . . 92

    5-2 Two spatial queries over a database of objects with gaussian uncertainty . . . . 97

    5-3 The sequence of SPRTs run by the end-biased test . . . . . . . . . . . . . . . . 98

    5-4 Building the MBRs used to index the samples from the end-biased test. . . . . . 104

    5-5 Using the index to speed the end-biased test . . . . . . . . . . . . . . . . . . . . 106

    10

  • 8/13/2019 Arumugam s

    11/123

    Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of the

    Requirements for the Degree of Doctor of Philosophy

    EFFICIENT ALGORITHMS FOR SPATIOTEMPORAL DATA MANAGEMENT

    By

    Subramanian Arumugam

    August 2008

    Chair: Christopher JermaineMajor: Computer Engineering

    This work focuses on interesting data management problems that arise in the analysis,

    modeling and querying of largescale spatiotemporal data. Such data naturally arise in the

    context of many scientific and engineering applications that deal with physical processes

    that evolve over time.

    We first focus on the issue of scalable query processing in spatiotemporal databases.

    In many applications that produce a large amount of data describing the paths of moving

    objects, there is a need to ask questions about the interaction of objects over a long

    recorded history. To aid such analysis, we consider the problem of computing joins over

    moving object histories. The particular join studied is the Closest-Point-Of-Approach

    join, which asks: Given a massive moving object history, which objects approached within

    a distance d of one another?

    Next, we study a novel variation of the classic entity resolution problem that

    appears in sensor network applications. In entity resolution, the goal is to determine

    whether or not various bits of data pertain to the same object. Given a large database of

    spatiotemporal sensor observations that consist of (location, timestamp) pairs, our goal is

    to perform an accurate segmentation of all of the observations into sets, where each set is

    associated with one object. Each set should also be annotated with the path of the object

    through the area.

    11

  • 8/13/2019 Arumugam s

    12/123

    Finally, we consider the problem of answering selection queries in a spatiotemporal

    database, in the presence of uncertainty incorporated through a probabilistic model.

    We propose very general algorithms that can be used to estimate the probability that a

    selection predicate evaluates to true over a probabilistic attribute or attributes, where

    the attributes are supplied only in the form of a pseudo-random attribute value generator.

    This enables the efficient evaluation of queries such as Find all vehicles that are in close

    proximity to one another with probabilityp at time t using Monte Carlo statistical

    methods.

    12

  • 8/13/2019 Arumugam s

    13/123

  • 8/13/2019 Arumugam s

    14/123

    Extending modern database systems to support spatiotemporal data is challenging for

    several reasons:

    Conventional databases are designed to manage static data, whereas spatiotemporal

    data describe spatial geometries that change continuously with time. This requires aunified approach to deal with aspects of spatiality and temporality.

    Current databases are designed to manage data that is precise. However, uncertaintyis often an inherent property in spatiotemporal data due to discretization ofcontinuous movement and measurement errors. The fact that most spatiotemporaldata sources (particularly polling and sampling-based schemes) provide only adiscrete snapshot of continuous movement poses new problems to query processing.For example, consider a conventional database record that stores the fact JohnSmith earns $200,000 and a spatiotemporal record which stores the fact JohnSmith walks from point A to point B in the form of an discretized ordered pair

    (A, B). In the former case, a query such as What is the salary of John Smith?involves dealing with precise data. On the other hand, a spatiotemporal querysuch as Did John Smith walk through point C between A and B? requires dealingwith information that is often not known with certainty. Further compounding theproblem is that even the recorded observations are only accurate to within a fewdecimal places. Thus, even queries queries such Identify all objects located at pointA may not return meaningful results unless allowed a certain margin for error.

    Due to the presence of the time dimension, spatiotemporal applications have thepotential to produce a large amount of data. The sheer volume of data generatedby spatiotemporal applications presents a computational and data management

    challenge. For instance, it is not uncommon for many scientific processes to producespatiotemporal data in the order of terabytes or even petabytes [7]. Developingscalable algorithms to support query processing over tera- and peta-byte-sizedspatiotemporal data sets is a significant challenge.

    The semantics of many basic operations in a database changes in the presence ofspace and time. For instance, basic operations like joins typically employ equalitypredicates in a classic relational database, whereas equality is rare between twoarbitrary spatiotemporal objects.

    1.2 Research Landscape

    Over the last decade, database researchers have begun to respond to the challenges

    posed by spatiotemporal data. Most of the research efforts is concentrated on supporting

    eitherpredictive orhistoricalqueries. Within this taxonomy, we can further distinguish

    work based on whether they support time-instanceor time-intervalqueries.

    14

  • 8/13/2019 Arumugam s

    15/123

    Inpredictivequeries, the focus is on the future position of the objects and only a

    limited time window of the object positions needs to be maintained. On the other hand,

    forhistoricalqueries, the interest is on efficient retrieval of past history and thus the

    database needs to maintain the complete timeline of an objects past locations. Due to

    these divergent requirements, techniques developed for predictive queries are often not

    suitable for historical queries.

    What follows is a brief tour of the major research areas in spatiotemporal data

    management. For a more complete treatment of this topic, the interested reader is

    referrred to [1,3].

    1.2.1 Data Modeling and Database Design

    Early research focused on aspects of data modeling and database design for

    spatiotemporal data [8]. Conventional data types employed in existing databases are

    often not suitable to represent spatiotemporal data which describe continuous time-varying

    spatial geometries. Thus, there is a need for a spatiotemporal type system that can model

    continuously moving data. Depending on whether the underlying spatial object has an

    extent or not, abstractions have been developed to model a moving point, line, and region

    in two- and three-dimensional space with time considered as the additional dimension

    [811]. Similarly, early work has also focused on refining existing CASE tools to aid in the

    design of spatiotemporal databases. Existing conceptual tools such as ER diagrams and

    UML present a non-temporal view of the world and extensions to incorporate temporal

    and spatial awareness has been investigated [12,13].

    Recently there has been interest in designing flexible type systems that can model

    aspects of uncertainty associated with an objects spatial location [14]. There has alsobeen active effort towards designing SQL language extensions for spatiotemporal data

    types and operations [15].

    15

  • 8/13/2019 Arumugam s

    16/123

    1.2.2 Access Methods

    Efficient processing of spatiotemporal queries requires developing new techniques

    for query evaluation, providing suitable access structures and storage mechanisms, and

    designing efficient algorithms for the implementation of spatiotemporal operators.

    Developing efficient access structures for spatiotemporal databases is an important

    area of research. A variety of spatiotemporal index structures have been developed to

    support selection queries over both predictive and historical queries, most based on

    generalization of the R-tree [16] to incorporate the time dimension. Indexing structures

    designed to support predictive queries typically manage object movement within a small

    time window and need to handle frequent updates to object locations. A popular choice

    for such applications is the TPR-tree [17] and its many variants.

    On the other hand, index structures designed to support historical queries need to

    manage an objects entire past movement trajectory (for this reason they can be viewed as

    trajectory indexes). Depending on the time interval indexed, the sheer volume of data that

    needs to be managed present significant technical challenges for overlap-allowing indexing

    schemes such as R-trees [16]. Thus, there has been interest in revisiting grid/cell-based

    solutions that do not allow overlap, such as SETI [18]. Several tree-based indexing

    structures have been developed such as STRIPES [19], 3D R-trees [20], TB trees [21] and

    linear quad trees [22]. Further, spatiotemporal extensions of several popular queries such

    as nearest-neighbor [23], top-k [24], and skyline [25] have been developed.

    1.2.3 Query Processing

    The development of efficient index structures has also led to a growing body of

    research on different types of queries on spatiotemporal data, such as time-instant andrange queries [2628], continuous queries, joins [29,30], and their efficient evaluation

    [31,32]. In the same vein, there has also been seem preliminary work on optimizing

    spatio-temporal selection queries [33,34].

    16

  • 8/13/2019 Arumugam s

    17/123

    Much of the work focuses specifically on indexing two-dimensional space and/or

    supporting time-instance or short time-interval selection queries. Thus many indexing

    structures often do not scale well for higher-dimensional spaces and have difficulty with

    queries over long time intervals. Finally, historical data collections may be huge and joins

    over such data require new solutions, since predicates involved are non-traditional (such as

    closest point of approach, within, sometimes-possibly-inside, etc.)

    1.2.4 Data Analysis

    Spatiotemporal data analysis allows us to obtain interesting insights from the stored

    data collection. For instance:

    In a road network database, the history of movement of various objects can be usedto understand traffic patterns.

    In aviation, the flight path of various planes can be used in future path planning andcomputing minimum separation constraints to avoid collision.

    In wildlife management, one can understand animal migration patterns from thetrajectories traced by them.

    Pollutants can be traced to their source by studing air flow patterns of aerosolsstored as trajectories.

    Research in this area focuses on extending traditional data mining techniques to the

    analysis of large spatiotemporal data sets. Of interest includes discovering similiarities

    among object trajectories [35], data classification and generalization [36], trajectory

    clustering and rule mining [3739], and supporting interactive visualization for browsing

    large spatiotemporal collections [40].

    1.2.5 Data Warehousing

    Supporting data analysis also requires designing and maintaining large collections ofhistorical spatiotemporal data, which falls under the domain ofdata warehousing.

    Conventional data warehouses are often designed around the goal of supporting

    aggregate queries efficiently. However, the interesting queries in a spatiotemporal data

    warehouse seek to discover the interaction patterns of moving objects and understand the

    17

  • 8/13/2019 Arumugam s

    18/123

    spatial and/or temporal relationships that exist between them. Facilitating such queries

    in a scalable fashion over terabyte-sized spatiotemporal data warehouses is a significant

    challenge. This requires extending traditional data mining techniques to the analysis

    of large spatiotemporal data sets to discover spatial and temporal relationships, which

    might exist at various levels of granularity involving complex data types. Research in

    spatiotemporal data warehousing [41,42] is relatively new and is focused on refining

    existing multidimensional models to support continuous data and defining semantics for

    spatiotemporal aggregation [43,44].

    1.3 Main Contributions

    It is clear that extending modern database systems to support data management

    and analysis of spatiotemporal data require addressing issues that span almost the entire

    breadth of database research. A full treatment of the various issues can be the subject of

    numerous dissertations! To keep the scope of this dissertation managable, I tackle three

    important problems in spatiotemporal data management. The dissertation focuses on

    data produced by moving objects, since moving object databases represent the most

    common application domain for spatiotemporal databases [1]. The three specific problems

    considered are described briefly in the following subsections.

    1.3.1 Scalable Join Processing over Massive Spatiotemporal Histories

    I first consider the scalability problem in computing joins over massive moving

    object histories. In applications that produce a large amount of data describing the

    paths of moving objects, there is a need to ask questions about the interaction of

    objects over a long recorded history. This problem is becoming especially important

    given the emergence of computational, simulation-based science (where simulationsof natural phenomenon naturally produce massive databases containing data with

    spatial and temporal characteristics), and the increased prevalence of tracking and

    positioning devices such as RFID and GPS. The particular join that I study is the CPA

    (Closest-Point-Of-Approach) join, which asks: Given a massive moving object history,

    18

  • 8/13/2019 Arumugam s

    19/123

    which objects approached within a distance d of one another? I carefully consider several

    obvious strategies for computing the answer to such a join, and then propose a novel,

    adaptive join algorithm which naturally alters the way in which it computes the join in

    response to the characteristics of the underlying data. A performance study over two

    physics-based simulation data sets and a third, synthetic data set validates the utility of

    my approach.

    1.3.2 Entity Resolution in Spatiotemporal Databases

    Next, I consider the problem of entity resolution for a large database of spatio-temporal

    sensor observations. The following scenario is assumed. At each time-tick, one or more of

    a large number of sensors report back that they have sensed activity at or near a specific

    spatial location. For example, a magnetic sensor may report that a large metal object has

    passed by. The goal is to partition the sensor observations into a number of subsets so

    that it is likely that all of the observations in a single subset are associated with the same

    entity, or physical object. For example, all of the sensor observations in one partition may

    correspond to a single vehicle driving accross the area that is monitored. The dissertation

    describes a two-phase, learning-based approach to solving this problem. In the first phase,

    a quadratic motion model is used to produce an initial classification that is valid for a

    short portion of the timeline. In the second phase, Bayesian methods are used to learn the

    long-term, unrestricted motion of the underlying objects.

    1.3.3 Selection Queries over Probabilistic Spatiotemporal Databases

    Finally, I consider the problem of answering selection queries in the presence of

    uncertainty incorporated through a probabilistic model. One way to facilitate the

    representation of uncertainty in a spatiotemporal database is by allowing tuples tohave probabilistic attributes whose actual values are unknown, but are assumed

    to be selected by sampling from a specified distribution. This can be supported by

    including a few, pre-specified, common distributions in the database system when it is

    shipped. However, to be truly general and extensible and support distributions that

    19

  • 8/13/2019 Arumugam s

    20/123

    cannot be represented explicitly or even integrated, it is necessary to provide an interface

    that allows the user to specify arbitrary distributions by implementing a function that

    produces pseudo-random samples from the desired distribution. Allowing a user to specify

    uncertainty via arbitrary sampling functions creates several interesting technical challenges

    during query evaluation. Specifically, evaluating time-instance selection queries such as

    Find all vehicles that are in close proximity to one another with probability p at time

    t requires the principled use of Monte Carlo statistical methods to determine whether

    the query predicate holds. To support such queries, the thesis describes new methods

    that draw heavily for the relevant statistical theory on sequential estimation. I also

    consider the problem of indexing for the Monte Carlo algorithms, so that samples from the

    pseudo-random attribute value generator can be pre-computed and stored in a structure in

    order to answer subsequent queries quickly.

    Organization. The rest of this study is organized as follows. Chapter 2 provides

    a survey of work related to the problems addressed in this thesis. Chapter 3 tackles the

    scalability issue when processing join queries over massive spatiotemporal databases.

    Chapter 4 describes an approach to handling the entity-resolution problem in cleaning

    spatiotemporal data sources. Chapter 5 describes a simple and general approach to

    answering selection queries over spatiotemporal databases that incorporate uncertainty

    within a probabilistic model framework (selection queries over probabilistic spatiotemporal

    databases). Chapter 6 concludes the dissertation by summarizing the contributions and

    identifying potential directions for future work.

    20

  • 8/13/2019 Arumugam s

    21/123

  • 8/13/2019 Arumugam s

    22/123

    To our knowledge, the only prior work on spatiotemporal joins is due to Jeong et al.

    [51]. However, they only consider spatiotemporal join techniques that are straightforward

    extensions to traditional spatial join algorithms. Further, they limit their scope to

    index-based algorithms for objects over limited time windows.

    2.2 Entity Resolution

    Research in entity resolution has a long history in databases [ 5255] and has focused

    mainly on integrating non-geometric string based data from noisy external sources. Closely

    related to the work in this thesis is the large body of work on target tracking that exists

    in fields as diverse as signal processing, robotics, and computer vision. The goal in target

    tracking [56,57] is to support the real-time monitoring and tracking of a set of moving

    objects from noisy observations.

    Various algorithms to classify observations among objects can be found in the

    target tracking literature. They characterize the problem as one of data association (i.e.

    associating observations with corresponding targets). A brief summary of the main ideas is

    given below.

    The seminal work is due to Reid [58] who propose a multiple hypothesis technique

    (MHT) to solve the tracking problem. In the MHT approach, a set of hypotheses is

    maintained with each hypothesis reflecting the belief on the location of an individual

    target. When a new set of observations arrive, the hypotheses are updated. Hypotheses

    with minimal support are deleted and additional hypotheses are created to reflect new

    evidence. The main drawback of the approach is that the number of hypotheses can grow

    exponentially over time. Though heuristic filters [5961] can be used to bound the search

    space, it limits the scalability of the algorithm.Target tracking also has been studied using Bayesian approaches [62]. The Bayesian

    approach views tracking as a state estimation problem. Given some initial state and a

    set of observations, the goal is to predict the objects next state. An optimal solution to

    the problem is given by Bayes Filter [63,64]. Bayes filters produces optimal estimates by

    22

  • 8/13/2019 Arumugam s

    23/123

    integrating over the complete set of observations. The formulation is often recursive and

    involves complex integrals that are difficult to solve analytically. Hence, approximation

    schemes such as particle filters [57] and sequential Monte Carlo techniques [63] are often

    used in practice.

    Recently, Markov Chain Monte Carlo (MCMC) [65,66] techniques have been

    proposed. MCMC techniques attempt to approximate the optimal Bayes filter for multiple

    target tracking. MCMC based methods employ sequential MC sampling and are shown to

    perform better than existing sub-optimal approaches such as MHT for tracking objects in

    highly cluttered environments.

    A common theme among most of the research in target tracking is its focus on

    accurate tracking and detection of objects in real time in highly cluttered environments

    over relatively short time periods. In a data warehouse context, the ability of techniques

    such as MCMC to make fine-grained distinctions make them ideal candidates when

    performing operations such asdrilldownthat involve analytics over small time windows.

    Their applicability is limited, however, to entity resolution in a data warehouse. In such a

    context, summarization and visualization of historical trajectories smoothed over long time

    intervals is often more useful. The model-based approach considered in this work seems a

    more suitable candidate for such tasks.

    2.3 Probabilistic Databases

    Uncertainty management in spatiotemporal databases is a relatively new area of

    research. Earlier work has focused on aspects of modeling uncertainty and query language

    support [9,67].

    In the context of query processing, one of the earliest papers in this area is thepaper by Pfoser et al. [68] where different sources of uncertainty are characterized and

    a probability density function is used to model errors. Hosbond et al. [69] extended this

    work by employing a hyper square uncertainty region, which expands over time to answer

    queries using a TPR-tree.

    23

  • 8/13/2019 Arumugam s

    24/123

    Trajcevksi et al. [70] study the problem from a modeling perspective. They model

    trajectories by a cylindrical volume in 3D and outline semantics of fuzzy selection queries

    over trajectories in both space and time. However, the approach does not specify how to

    choose the dimensions of the cylindrical region which may have to change over time to

    account for shrinking or expanding of the underlying uncertainty region.

    Cheng et al. [71] describe algorithms for time instant queries (probabilistic range

    and nearest neighbor) using an uncertainty model where a probabilty density function

    (PDF) and an uncertain region is associated with each point object. Given a location in

    the uncertain region, the PDF returns the probablity of finding the object at that location.

    A similar idea is used by Tao et al. [72] to answer queries in spatial databases. To handle

    time interval queries, Mokhtar et al. [73] represent uncertain trajectories as a stochastic

    process with a time-parametric uniform distribution.

    24

  • 8/13/2019 Arumugam s

    25/123

    CHAPTER 3SCALABLE JOIN PROCESSING OVER SPATIOTEMPORAL HISTORIES

    In applications that produce a large amount of data describing the paths of

    moving objects, there is a need to ask questions about the interaction of objects

    over a long recorded history. In this chapter, the problem of computing joins over

    massive moving object histories is considered. The particular join studied is the

    Closest-Point-Of-Approach join, which asks: Given a massive moving object history,

    which objects approached within a distance d of one another?

    3.1 Motivation

    Frequently, it is of interest in applications which make use of spatial data to ask

    questions about the interaction between spatial objects. A useful operation that enables

    one to answer such questions is the spatial joinoperation. Spatial join is similar to the

    classical relational join except that it is defined over two spatial relations based on a

    spatial predicate. The objective of the join operation is to retrieve all object pairs that

    satisfy a spatial relationship. One common predicate involves distance measures, where

    we are interested in objects that were within a certain distance of each other. The query

    Find all restaurants within distance 10 miles from a hotel is an example of a spatialjoin.

    For moving objects, the spatial join operation involves the evaluation of both a spatial

    and a temporal predicate and for this reason the join is referred to as a spatiotemporal

    join. For example, consider the spatial relations PLANESandTANKS, where each relation

    represents accumulated trajectory data of planes and tanks from a battlefield simulation.

    The queryFind all planes that are within distance 10 miles of a tank is an example of a

    spatiotemporal join. The spatial predicate in this case restricts the distance (10 miles) and

    the temporal predicate restricts the time period to the current time instance.

    In the more general case, the spatiotemporal join is issued over a moving object

    history, which contains all of the past locations of the objects stored in a database. For

    25

  • 8/13/2019 Arumugam s

    26/123

    example, consider the query Find all pairs of planes that came within distance of 1000

    feet during their flight path. Since there is no restriction of the temporal predicate,

    answering this query involves an evaluation of the spatial predicate at every time instance.

    The amount of data to be processed can be overwhelming. For example, in a typical

    flight, the flight data recorder stores about 7 MB of data which records among other

    things, the position and time of the flight for every second during its operation. Given

    that on average the US Air Traffic Control handles around 30000 flights in a single day,

    if all of this data were archived, it would result in 200 GB of data accumulation just

    for a single day. For another example, it is not uncommon for scientific simulations to

    output terabytes or even petabytes of spatiotemporal data (see Abdulla et al.[7] and the

    references contained therein).

    In this chapter, the spatial-temporal join problem for moving object histories in

    three-dimensional space, with time considered as the fourth dimension is investigated. The

    spatiotemporal join operation considered is the CPA Join (Closest-Point-Of-Approach

    Join). ByClosest Point of Approach, we refer to a position at which two moving objects

    attain their closest possible distance[74]. Formally, in a CPA Join, we answer queries of

    the following type: Find all object pairs(p P, qQ) from relations P and Q such that

    CPA-distance (p, q) d. The goal is to retrieve all object pairs that are within a distance

    dat their closest-point-of-approach.

    Surprisingly, this problem has not been studied previously. The spatial join problem

    has been well-studied for stationary objects in two- and three-dimensionsal space [ 45,47

    49], however very little work related to spatiotemporal joins can be found in literature.

    There has been some work related to joins involving moving objects [75,76] but the workhas been restricted to objects in a limited time window and does not consider the problem

    of joining object histories that may be gigabytes or terabytes in size.

    26

  • 8/13/2019 Arumugam s

    27/123

    The contributions can be summarized as follows:

    Three spatiotemporal join strategies for data involving moving object histories ispresented.

    Simple adaptations of existing spatial join processing algorithms, based on the R-treestructure and on a plane-sweeping algorithm, for spatiotemporal histories is explored.

    To address the problems associated with straightforward extensions to thesetechniques, a novel join strategy for moving objects based on an extension of thebasic plane-sweeping algorithm is described.

    A rigorous evaluation and benchmarking of the alternatives is provided. Theperformance results suggest that we can obtain significant speedup in execution timewith the adaptive plane-sweeping technique.

    The rest of this chapter is organized as follows: In Section 3.2, the closest point

    of approach problem is reviewed. In Sections 3.3 and 3.4, two obvious alternatives to

    implementing the CPA join using R-trees and plane-sweeping is described. In Section

    3.5, a novel adaptive plane-sweeping technique that outperforms competing techniques

    considerably is presented. Results from our benchmarking experiments are given in Section

    3.6. Section 3.7 outlines related work.

    3.2 Background

    In this Section, we discuss the motion of moving objects, and give an intutive

    description of the CPA problem. This is followed by an analytic solution to the CPA

    problem over a pair of points moving in a straight line.

    3.2.1 Moving Object Trajectories

    Trajectories describe the motion of objects in a 2 or 3-dimensional space. Real-world

    objects tend to have smooth trajectories and storing them for analysis often involves

    approximation to a polyline. Apolylineapproximation of a trajectory connects object

    positions, sampled at discrete time instances, by line segments (Figure 3-1).

    In a database the trajectory of an object can be represented as a sequence of the form

    (t1, v1), (t2, v2), . . . , (tn, vn)where each vi represents the position vector of the object at

    27

  • 8/13/2019 Arumugam s

    28/123

    t6

    t7

    t8

    (B)(A)

    t3

    y

    x

    t0

    t1

    t2

    t4

    t5

    t10t9

    Figure 3-1. Trajectory of an object (a) and its polyline approximation (b)

    time instance ti. The arityof the vector describes the dimensions of the space. For flight

    simulation data, the arity would be 3, whereas for a moving car, the arity would be 2. The

    position of the moving objects is normally obtained in one of several ways: by sampling orpolling the object at discrete time instances, through use of devices like GPS, etc.

    3.2.2 Closest Point of Approach (CPA) Problem

    We are now ready to describe the CPA problem. Let CPA(p,q,d)over two straight

    line trajectories be evaluated as follows. Assuming the distance between the two objects is

    given bymindist, then we output true ifmindist < d (the objects were within distance d

    during their motion in space), otherwise f alse. We refer to the calculation of CPA(p,q,d)

    as the CPA problem.

    The minimum distance mindist between two objects is the distance between the

    object positions at their closest point of approach. It is straightforward to calculate

    mindistonce the CPA time tcpa, time instance at which the objects reached their closest

    distance, is known.

    We now give an analytic solution to the CPA problem for a pair of objects on a

    simple straight-line trajectory.

    Calculating the CPA time tcpa. Figure 3-2 shows the trajectory of two objects p

    and qin 2-dimensional space for the time period [tstart, tend]. The position of these objects

    at any time instance t is given byp(t) andq(t). Let their positions at time t = 0 be p0

    and q0 and let their velocity vectors per unit of time be u andv . The motion equations for

    28

  • 8/13/2019 Arumugam s

    29/123

    t0

    t3t4

    t1

    t2

    tcpa

    t4t3

    t2t0

    q

    distcpa

    tstarttend

    t1 p

    Figure 3-2. Closest Point of Approach Illustration

    q[3]p[3]

    q[1]

    qp

    p[2]

    p[1]

    q[2]

    y

    x

    t

    Figure 3-3. CPA Illustration with trajectories

    these two objects are p(t) = p0+ tu; q(t) = q0+ tv. At any time instance t, the distance

    between the two objects is given by d(t) =|p(t) q(t)|.

    Using basic calculus, one can find the time instance at which the distance d(t) isminimum (whenD(t) =d(t)2 is a minimum). Solving for this time we obtain:

    tcpa=(po qo).(u v)

    |u v|2

    Given this, mindist is given by |p(tcpa) q(tcpa)|.

    The distance calculation that we described above is applicable only between two

    objects on a straight line trajectory. To calculate the distance between two objects on a

    polyline trajectory, we apply the same basic technique. For trajectories consisting of a

    chain of line-segments, we find the minimum distance by first determining the distance

    between each pair of line-segments and then choosing the minimum distance.

    As an example, consider Figure 3-3 which shows the trajectory of two objects in

    2-dimensional space with time as the third dimension. Each object is represented by

    29

  • 8/13/2019 Arumugam s

    30/123

    an array that stores the chain of segments comprising the trajectory. The line-segments

    are labeled by the array indices. To determine the qualifying pairs, we find the CPA

    distance between the line segment pairs (p[1],q[1]) (p[1],q[2]) (p[1],q[3]) (p[2],q[1])

    (p[2],q[2]) (p[2],q[3]) (p[3],q[1]) (p[3],q[2]) (p[3],q[3]) and return the pair with the minimum

    distance among all evaluated pairs. The complete code for computing CPA(p,q,d)over

    multi-segment trajectories is given as Algorithm 3-1.

    Algorithm 1 CPA (Object p, Object q, distance d)1: mindist = 2: for(i = 1 t o p.size)do3: for(j = 1 t o q.size)do4: tmp = CPA Distance(p[i], q[j])

    5: if (tmp mindist)then6: mindist = tmp7: end if8: end for9: end for

    10: if (mindist d)then11: returntrue12: end if13: return false

    In the next two Sections, we consider two obvious alternatives for computing the

    CPA Join, where we wish to discover al lpairs of objects (p, q) from two relations P and

    Q, where CPA(p, q,d)evaluates to true. The first technique we describe makes use of an

    underlying R-tree index structure to speed up join processing. The second methodology is

    based on a simple plane-sweep.

    3.3 Join Using Indexing Structures

    Given numerous existing spatiotemporal indexing structures, it is natural to first

    consider employing a suitable index to perform the join.

    Though many indexing structures exist, unfortunately most are not suitable for the

    CPA Join. For example, a large number of indexing structures like the TPR-tree [17],

    REXP tree [77], TPR*-tree [78] have been developed to support predictive queries, where

    30

  • 8/13/2019 Arumugam s

    31/123

    the focus is on indexing the future position of an object. However, these index structures

    are generally not suitable for CPA Join, where access to the entire history is needed.

    Indexing structures like MR-tree [26], MV3R-tree [27], HR-tree [28], HR+-tree [27]

    are more relevant since they are geared towards answering time instance queries (in case

    of MV3R-tree also short time-interval queries), where all objects alive at a certain time

    instance are retrieved. The general idea behind these index structures is to maintain a

    separate spatial index for each time instance. However, such indices are meant to store

    discrete snapshots of a evolving spatial database, and are not ideal for use with CPA Join

    over continuous trajectories.

    3.3.1 Trajectory Index Structures

    More relevant are indexing structures specific to moving object trajectory histories

    like the TB-tree, STR-tree [21] and SETI [18]. TB-trees emphasize trajectory preservation

    since they are primarily designed to handle topological queries where access to entire

    trajectory is desired (segments belonging to the same trajectory are stored together). The

    problem with TB-trees in the context of the CPA Join is that segments from different

    trajectories that are close in space or time will be scattered across nodes. Thus, retrieving

    segments in a given time window will require several random I/Os. In the same paper

    [21], a STR tree is introduced that attempts to somewhat balance spatial locality with

    trajectory preservation. However, as the authors point out STR-trees turn out to be a

    weak compromise that do not perform better than traditional 3D R-trees [20] or TB-trees.

    More appropriate to the CPA Join is SETI [18]. SETI partitions two-dimensional space

    statically into non-overlapping cells and uses a separate spatial index for each cell. SETI

    might be a good candidate for CPA Join since it preserves spatial and temporal locality.However, there are several reasons why SETI is not the most natural choice for a CPA

    Join:

    It is not clear that SETIs forest scales to a three-dimensional space. A 25 25 SETIgrid in two-dimension becomes a sparse 25 25 25 grid with almost 20, 000 cells inthree-dimension.

    31

  • 8/13/2019 Arumugam s

    32/123

    SETIs grid structure is an interesting idea for addressing problems with highvariance in object speeds (we will use a related idea for the adaptive plane-sweepalgorithm described later). However, it is not clear how to size the grid for a givendata set, and sizing it for a join seems even harder. It might very well be thatrelationR should have a different grid for R Scompared to R T.

    For a CPA Join over a limited history, SETI has no way of pruning the search space,since every cell will have to be searched.

    3.3.2 R-tree Based CPA Join

    Given these caveats, perhaps the most natural choice for the CPA Join is the R-tree

    [16]. The R-tree [16] is a hierarchical, multi-dimensional index structure that is commonly

    used to index spatial objects. The join problem has been studied extensively for R-trees

    and several spatial join techniques exist [45,46,79] that leverage underlying R-treeindex structures to speed-up join processing. Hence, our first inclination is to consider a

    spatiotemporal join strategy that is based on R-trees. The basic idea is to index object

    histories using R-trees and then perform a join over these indices.

    The R-Tree Index

    It is a very straightforward task to adapt the R-tree to index a history of moving

    object trajectories. Assuming three spatial dimensions and a fourth temporal dimension,

    the four-dimensional line segments making up each individual object trajectory are simply

    treated as individual spatial objects and indexed directly by the R-tree. The R-tree and

    its associated insertion or packing algorithms are used to group those line segments into

    disk-page sized groups, based on proximity in their four-dimensional space. These pages

    make up the leaf level of the tree. As in a standard R-tree, these leaf pages are indexed

    by computing the minimum bounding rectangle that encloses the set of objects stored in

    each leaf page. Those rectangles are in turn grouped into disk-page sized groups which are

    themselves indexed. An R-tree index for 3 line segments moving through 2-dimensional

    space is depicted in Figure 3-4.

    32

  • 8/13/2019 Arumugam s

    33/123

    p1[3]p3 [2]p1[2] p1 [4]p3[3] p2 [4]

    p3[1]

    p2[3]

    p3[2]

    p3[3]

    p1[4]

    p2[1]

    p2[2]

    p1[3]

    I1

    I2

    I3

    ty

    x

    p2[4]

    I1 I2 I3

    p1[2]

    p1[1]

    p2[3]

    p1[2]

    p1[3]

    p3[2]

    p2[3]

    p1[4]p2[4]

    p3[3]

    p2 [2]p3[1]p2 [1]p1 [1]

    p1[1]

    p2[1]

    p2[2]

    p3[1]

    Figure 3-4. Example of an R-tree

    Basic CPA Join Algorithm Using R-Trees

    Assuming that the two spatiotemporal relations to be joined are organized using

    R-trees, we can use one of the standard R-tree distance joins as a basis for the CPA Join.

    The common approach to joins using R-trees employ carefully controlled synchronized

    traversal of the two R-trees to be joined. The pruning power of the R-tree index arises

    from the fact that if two bounding rectangles R1 andR2 do not satisfy the join predicate

    then the join predicate is not satisfied between any two bounding rectangles that can be

    enclosed within R1 or R2.

    In a synchronizedtechnique, both the R-trees are simultaneously traversed retrieving

    object-pairs that satisfy the join predicate. To begin with, the root nodes of both the

    R-trees are pushed into a queue. A pair of nodes from the queue is processed by pairing

    up every entry of the first node with every entry in the second node to form the candidate

    set for further expansion. Each pair in the candidate set that qualifies the join predicate is

    pushed into the queue for subsequent processing. The strategy described leads to a BFS

    (Breadth-First-Search) expansion of the trees. BFS-style traversal lends itself to global

    optimization of the join processing steps [46] and works well in practice.

    33

  • 8/13/2019 Arumugam s

    34/123

    d2

    l1

    P2

    l2

    P1

    darbit

    d1

    x

    y

    z

    dreal

    Figure 3-5. Heuristic to speed up distance computation

    The distance routine is used in evaluating the join predicate to determine the distance

    between two bounding rectangles associated with a pair of nodes. A node-pair qualifies

    for further expansion if the distance between the pair is less than the limiting distance d

    supplied by the query.

    Heuristics to Improve the Basic Algorithm

    The basic join algorithm can be improved in several ways by using several standard

    and non-standard techniques for reducing I/O and CPU costs over spatial joins. These

    include:

    Using a plane-sweeping algorithm [45] to speed up the all-pairs distance computationwhen pairs of nodes are expanded and their children are checked for possiblematches.

    Carefully considering the processing of node pairs so that when each pair isconsidered, one or both of the nodes are in the buffer [46].

    Avoiding expensive distance computations by applying heuristic filters. Computingthe distance between two 3-dimensional rectangles can be a very costly operation,

    since the closest points may be on arbitrary positions on the faces of the rectangles.To speed this computation, the magnitudes of the diagonals of the two rectangles(d1 andd2) can be computed first. Next, we pick an arbitrary point from both ofthe rectangles (points P1 and P2), and compute the distance between them, calleddarbit. Ifdarbit d1 d2 > djoin , then the two rectangles cannot contain any pointsas close as djoin from one another and the pair can be discarded, as shown in Figure3-5. This provides for immediate dismissals with only three distance computations(or one if the diagonal distances are precomputed and stored with each rectangle).

    34

  • 8/13/2019 Arumugam s

    35/123

    object p

    objectq

    ty

    x

    Figure 3-6. Issues with R-trees- Fast moving object p joins with everyone

    In addition, there are some obvious improvements to the algorithm that can be made

    which are specific to the 4-dimensional CPA Join:

    The fourth dimension, time, can be used as an initial filter. If two MBRs or linesegments do not overlap in time, then the pair cannot possibly be a candidate for aCPA match.

    Since time can be used to provide for immediate dismissals without Euclideandistance calculations it is given priority over the attributes. For example, when aplane-sweep is performed to prune an all-pairs CPA distance computation, time isalways chosen as the sweeping axis. The reason is that time will usually have thegreatest pruning power of any attributes since time-based matches must always beexact, regardless of the join distance.

    In our implementaion of the CPA Join for R-trees, we make use of the STR packingalgorithm [80] to build the trees. Because the potential pruning power of the timedimension is greatest, we ensure that the trees are well-organized with respect totime by choosing time as the first packing dimension.

    Problem With R-tree CPA Join

    Unfortunately, it turns out that in practice the R-tree can be ill-suited to the problem

    of computing spatiotemporal joins over moving object histories. R-trees have a problem

    handling databases with a high variance in object velocities. The reason for this is thatjoin algorithms which make use of R-trees rely on tight and well-behaved minimum

    bounding rectangles to speed the processing of the join. When the positions of a set of

    moving objects are sampled at periodic intervals, fast moving objects tend to produce

    larger bounding rectangles than slow moving objects.

    35

  • 8/13/2019 Arumugam s

    36/123

    p1

    q2

    q1

    p2

    y

    time

    tendtstart

    Figure 3-7. Progression of plane-sweep

    One such scenario is depicted in Figure 3-6, which shows the paths of a set of objects

    on a 2-D plane for a given time period. A fast moving object such as pwill be contained

    in a very large MBR, while slower objects such as qwill be contained in much smaller

    MBRs. When a spatial join is computed over R-trees storing these MBRs, the MBR

    associated with p can overlap many smaller MBRs, and each overlap will result in an

    expensive distance computation (even if the objects do not travel close to one another).

    Thus, any sort of variance in object velocities can adversely affect the performance of the

    join.

    3.4 Join Using Plane-Sweeping

    The second technique that is considered is a join strategy based on a simple plane-

    sweep. Plane-sweep is a powerful technique for solving proximity problems involving

    geometric objects in a plane and has previously been proposed [49] as a way to efficiently

    compute the spatial join operation.

    3.4.1 Basic CPA Join using Plane-Sweeping

    Plane-sweep is an excellent candidate for use with the CPA join because no matter

    what distance threshold is given as input into the join, two objects must overlap in the

    time dimension for there to be a potential CPA match. Thus, given two spatiotemporal

    relationsP andQ, we could easily base our implementation of the CPA join on a

    plane-sweep along the time dimension.

    36

  • 8/13/2019 Arumugam s

    37/123

    We would begin a plane-sweep evaluation of the CPA join by first sorting the intervals

    making up P and Q along the time dimension, as depicted in Figure 3-7. We then sweep

    a vertical line along the time dimension. A sweepline data structure D is maintained

    which keeps track of all line segments which are valid given the current position of the line

    along the time dimension. As the sweepline progresses, D is updated with insertions (new

    segments that became active) and deletions (segments whose time period has expired).

    Segment pairs from both input relations that satisfy the join predicate are always present

    in D, and they can be checked and reported during updates to D. Pseudo-code for the

    algorithm is given below:

    Algorithm 2 PlaneSweep (Relation P, Relation Q, distance d)1: Form a single list L containing segments from P and Q sorted bytstart2: Initialize sweepline data structure D3: while not IsEmpty (L)do4: Segmenttop = popFront (L)5: Insert (D,top)6: Delete from D all segmentss s.t. (s.tend < top.tstart){remove segments that donot

    intersect sweepline}7: Query (D,top, d){report segments in D that are within distance dist}8: end while

    In the case of the CPA join, assuming that all moving objects at any given moment

    can be stored in main memory, any of a number of data structures can be used to

    implement D, such as a quad- or oct-tree, or an interval skip-list [81]. The main

    requirement is that the data structure selected should easily be possible to check proximity

    of objects in space.

    3.4.2 Problem With The Basic Approach

    Although the plane-sweep approach is simple, in practice it is usually too slow to

    be useful for processing moving object queries. The problem has to do with how the

    sweepline progression takes place. As the sweepline moves through the data space, it has

    to stop momentarily at sample points (time instances at which object positions where

    recorded) to process newly encountered segments into the data structure D. New segments

    37

  • 8/13/2019 Arumugam s

    38/123

    y

    time

    tend

    q2

    q1

    p2

    p1

    tstart

    Figure 3-8. Layered Plane-Sweep

    that are encountered at the sample point are added into the data structure and segments

    in D that are no longer active are deleted from it.

    Consequently, the sweepline pauses more often when objects with high sampling rates

    are present, and the progress of the sweepline is heavily influenced by the sampling rates

    of the underlying objects. For example, consider Figure 3-7 which shows the trajectory

    of four objects in a given time period. In the case illustrated, object p2 controls the

    progression of the sweepline. Observe that in the time-interval [tstart, tend], only new

    segments from object p2 get added to D but expensive join computations are performed

    each time with same set of line segments.

    The net result is that if the sampling rate of a data set is very high relative to the

    amount of object movement in the data set, then processing a multi-gigabyte object

    history using a simple plane-sweeping algorithm may take a prohibitively long time.

    3.4.3 Layered Plane-Sweep

    One way to address this problem is to reduce the number of segment level comparisons

    by comparing the regions of movementof various objects at a coarser level. For example,

    reconsider the CPA join depicted in Figure 7. If we were to replace the many oscillations

    of objectp2 with a single minimum bounding rectangle which enclosed all of those

    oscillations fromtstart totend, we could then use that rectangle during the plane-sweep

    38

  • 8/13/2019 Arumugam s

    39/123

    as an intial approximation to the path of object p2. This would potentially save many

    distance computations.

    This idea can be taken to its natural conclusion by constructing a minimum bounding

    box that encompasses the line-segments of each object. A plane-sweep is then performed

    over the bounding boxes, and only qualifying boxes are expanded further. We refer to this

    technique as the Layered Plane-Sweep approach since plane-sweep is performed at two

    layers one at a coarser level of bounding boxes and then at the finer level of individual

    line segments.

    One issue that must be considered is how much movement is to be summarized

    within the bounding rectangle for each object. Since we would like to eliminate as many

    comparisons as possible, one natural choice would be to let the available system memory

    dictate how much movement is covered for each object. Given a fixed buffer size, the

    algorithm will proceed as follows.

    Algorithm 3 LayeredPlaneSweep(Relation P, Relation Q, distance d)

    1: Segments defined by [(xsart, xend), (ystart, yend), (zstart,zend), (tstart, tend)]2: Assume a sorted list of object segments (by tstart) in disk3: while there is still some unprocessed data do4:

    Read in enough data fromP andQto fill the buffer5: Lettstart be the first time tick which has not yet been processed by the plane-sweep6: Lettend be the last time tick for which no data is still on disk7: Next, bound the trajectory of every object present in the buffer by a MBR8: Sort the MBRs along one of the spatial dimension and then perform a plane-sweep

    along that dimension9: Expand the qualifying MBR pairs to get the actual trajectory data (line segments)

    10: Sort the line segments bytstart11: Perform a final sweep along the time dimension to get the final result set12: end while

    Figure 3-8 illustrates the idea. It depicts the snapshot of object trajectories starting

    at some time instance tstart. Segments in the interval [tstart, tend] represent the maximum

    that can be buffered in the available memory. A first level plane-sweep is carried out over

    the bounding boxes to eliminate false positives. Qualifying objects are expanded and a

    second-level plane-sweep is carried out over individual line-segments. In the best case,

    39

  • 8/13/2019 Arumugam s

    40/123

    tstart

    q2q2

    tend

    time

    q1

    p2

    p1

    y

    Figure 3-9. Problem with using large granularities for bounding box approximation

    there is an opportunity to process the entire data set through just three comparisons at

    the MBR level.

    3.5 Adaptive Plane-Sweeping

    While the layered plane-sweep typically performs far better than the basic plane-sweeping

    algorithm, it may not always choose the proper level of granularity for the bounding box

    approximations. This Section describes an adaptive strategy that takes into careful

    consideration the underlying object interaction dynamics and adjusts this granularity

    dynamically in response to the underlying data characteristics.

    3.5.1 Motivation

    In the simple layered plane-sweep, the granularity for the bounding box approximation

    is always dictated by the available system memory. The assumption is that pruning power

    increases monotonically with increasing granularity. Unfortunately, this is not always the

    case. As a motivating example, consider Figure 3-9. Assume available system memory

    allows us to buffer all the line segments. In this case, the layered plane-sweep performs

    no better than the basic plane-sweep, due to the fact that all the object bounding

    boxes overlap with each other and as a result no pruning is achieved at the first-level

    plane-sweep.

    However, assume we had instead fixed the granulatiry to correspond to the time

    period [tstart, ti], as depicted in Figure 3-10. In this case, none of the bounding boxes

    40

  • 8/13/2019 Arumugam s

    41/123

    overlap, and there are possibly many dismissals at the first level. Though less of the

    buffer is processed intitially, we are able to eliminate many of the segment-level distance

    comparisons compared to a technique that bounds the entire time-period, thereby

    potentially increasing the efficiency of the algorithm. The entire buffer can then be

    processed in a piece-by-piece fashion, as depicted in Figure 3-10. In general, the efficiency

    of the layered plane-sweep is tied not to the granularity of the time interval that is

    processed, but the granularity that minimizes the number of distance comparisons.

    3.5.2 Cost Associated With a Given Granularity

    Since distance computations dominate the time required to compute a typical CPA

    Join, the cost associated with a given granularity can be approximated as a function of the

    number of distance comparisons that are needed to process the segments encompassed in

    that granularity. LetnMBR be the number of distance computations at the box-level, let

    nseg be the number of distance calculations at the segment-level, and let be the fraction

    of the time range in the buffer which is processed at once. Then the cost associated with

    processing that fraction of the buffer can be estimated as:

    cost

    = (nseg

    +nMBR

    )(1 )

    This function reflects the fact that if we choose a very small value for , we will have to

    process many cut-points in order to process the entire buffer, which can increase the cost

    of the join. As shrinks, the algorithm becomes equivalent to the traditional plane-sweep.

    On the other hand, choosing a very large value for tends to increase (nseg +nMBR),

    eventually yielding an algorithm which is equivalent to the simple, layered plane-sweep. In

    practice, the optimal value for lies somewhere in between the two extremes, and variesfrom data set to data set.

    3.5.3 The Basic Adaptive Plane-Sweep

    Given this cost function, it is easy to design a greedy plane-sweep algorithm that

    attempts to repeatedly minimize cost in order to adapt the underlying (and potentially

    41

  • 8/13/2019 Arumugam s

    42/123

    tendtj

    p1

    p2

    q1

    q2

    time

    y

    titstart

    Figure 3-10. Adaptively varying the granularity

    time-varying) characteristics of the data. At every iteration, the algorithm simply chooses

    to process the fraction of the buffer that appears to minimize the overall cost of the

    plane-sweep in terms of the expected number of distance computations. The algorithm is

    given below:

    Algorithm 4 AdaptivePlaneSweep(Relation P, Relation Q, distance d)

    1: while there is still some unprocessed data do2: Read in enough data fromP andQto fill the buffer3: Lettstart be the first time tick which has not yet been processed by the plane-sweep4: Lettend be the last time tick for which no data is still on disk5: Choose so as to minimize cost6: Perform a layered plane-sweep from time tstart totstart+ (tend tstart){steps 5-9

    of procedure LayeredPlaneSweep}7: end while

    Unfortunately, there are two obvious difficulties involved with actually implementing

    the above algorithm:

    First, the cost cost associated with a given granularity is known only after thelayered plane has been executed at that granularity.

    Second, even if we can compute cost easily, it is not obvious how we can computecost for all values of from 0 to 1 so as to minimize cost over all .

    These two issues are discussed in detail in the next two Sections.

    42

  • 8/13/2019 Arumugam s

    43/123

    3.5.4 Estimating Cost

    This Section describes how to efficiently estimate cost for a given using a simple,

    online sampling algorithm reminiscent of the algorithm of Hellerstein and Haas [82].

    At a high level, the idea is as follows. To estimate cost, we begin by constructing

    bounding rectangles for all of the objects in P considering their trajectories from time

    tstart to(tend tstart). These rectangles are then inserted into an in-memory index, just as

    if we were going to perform a layered plane-sweep. Next, we randomly choose an object q1

    fromQ, and construct a bounding box for its trajectory as well. This object is joined with

    all of the objects inPby using the in-memory index to find all bounding boxes within

    distanced ofq1. Then:

    LetnMBR,q1 be the number of distance computations needed by the index tocompute which objects fromP have bounding rectangles within distance d of thebounding rectangle forq1, and

    Letnseg,q1 be the total number of distance computations that would have beenneeded to compute the CPA distance between q1 and every object p P whosebounding rectangle is within distance d of the bounding rectangle for q1 (this can becomputed efficiently by performing a plane-sweep without actually performing therequired distance computations).

    Once nMBR,q1 andnseg,q1 have been computed forq1, the process can be repeated for a

    second randomly selected objectq2 Q, for a third object q3 and so on. A key observation

    is that after m objects fromQhave been processed, the value

    m= 1m

    mi=1

    (nMBR,q1+nseg,q1)|Q|

    represents an unbiased estimator for (nMBR +nseg) at , where|Q| denotes the number of

    data objects inQ.In practice, however, we are not only interested inm. We would also like to know at

    all times just how accurate our estimatem is, since at the point where we are satisfiedwith our guess as to the real value ofcost, we want to stop the process of estimating

    cost and continue with the join.

    43

  • 8/13/2019 Arumugam s

    44/123

    Fortunately, the central limit theorem can easily be used to estimate the accuracy

    ofm. Assuming sampling with replacement from Q, for large enough m the error of ourestimate will be normally distributed around (nMBR +nseg) with variance

    2m =

    1m

    2(Q),

    where2(Q) is defined as

    1

    |Q|

    |Q|i=1

    {(nMBR,q1+nseg,q1)|Q| (nMBR +nseg)}2

    Since in practice we cannot know 2(Q), it must be estimated via the expression

    2(Qm) = 1m 1

    mi=1

    {(nMBR,q1+ nseg,q1)|Q| m}(Qm denotes the sample ofQ that is obtained afterm objects have been randomlyretrieved from Q). Substituting into the expression for 2m, we can treatm as a normallydistributed random variable with variance b

    2(Qm)m .

    In our implementation of the adaptive plane-sweep, we can continue the sampling

    process until our estimate for cost is accurate to within 10% at 95% confidence. Since

    95% of the standard normal distribution falls within two standard deviations of the mean,

    this is equivalent to sampling until b2(Qm)

    m is less thanm 0.1.3.5.5 Determining The Best Cost

    We now address the second issue: how to compute cost for all values of from 0 to

    1 so as to minimize cost over all.

    Calculatingcost for all possible values of is prohibitively expensive and hence not

    feasible in practice. Fortunately, in practice we do not have to evaluate all values of to

    determine the best. This is due to the following interesting fact: If we plot all possible

    values of and their respective associated cost, we would observe that the graph is not

    linear, but exhibits a certain concavity. The concave region of the graph represents a sweet

    spotand represents the feasible region for the best cost.

    As an example consider Figure 3-11, which shows the plot of the cost function for

    various fractions of for one of the experimental data sets from Section 3.6. Given this

    44

  • 8/13/2019 Arumugam s

    45/123

    2e+07

    3e+07

    4e+07

    5e+07

    6e+07

    7e+07

    8e+07

    9e+07

    0 10 20 30 40 50 60 70 80 90 100#ofDistanceCom

    putations(estimated)

    % of buffer

    Convexity of Cost Function for k = 20

    Figure 3-11. Convexity of cost function illustration.

    fact, we identify the feasible region by evaluating costi for a small number, k , ofi

    values. Givenk the number of allowed cutpoints, the fraction 1 can be determined asfollows:

    1 =r(

    1k)

    r

    wherer = (tendtstart) is the time range described by the buffer (the above formula

    assumes thatr is greater than one; if not, then the time range should be scaled accordingly).

    In the general case, the fraction of the buffer considered by any i(1 i k) is given by:

    i = (r 1)i

    r

    Note that since the growth rate of each subsequent i is exponential, we can cover

    the entire buffer with just a small k and still guarantee that we will consider some value

    ofi that is within a factor of1 from the optimal. After computing 1,2,. . . ,k, we

    successively evaluate increasing buffer fractions, 1, 2, 3, and so on and determine their

    associated costs. From these k costs we determine the i with the minimum cost.

    Note that if we choose based on the evaluation of a smallk, then it is possible that

    the optimal choice ofmay lie outside the feasible region. However, there is a simple

    approach to solving this issue. After an initial evaluation ofk granularities, consider

    just the region starting before and ending after the best k and recursively reapply the

    evaluation described above just in this region.

    45

  • 8/13/2019 Arumugam s

    46/123

    32

    4 5

    5

    3

    tj

    ti

    tendtstart

    5

    mincost

    mincost

    mincost

    321

    4

    4

    1

    1

    2

    Figure 3-12. Iteratively evaluating k cut points

    For instance, assume we chose i after evaluation ofk cutpoints in the time range r .

    To further tune this i, we consider the time range defined between the adjacent cutpoints

    i1 and i+1 and recursively apply cost estimation in this interval. (i.e., evaluate k points

    in the time range (tstart+i1 r, tstart+i+1 r)). Figure 3-12 illustrates the idea. This

    approach is simple and very effective in considering a large number of choices of.

    3.5.6 Speeding Up the Estimation

    Restricting the number of candidate cut points can help keep the time required to

    find a suitable value formanageable. However, if the estimation is not implemented

    carefully, the time required to consider the cost at each of the k possible time periods can

    still be significant.

    The most obvious method for estimating cost for each of the k granularities wouldbe to simply loop through each of the associated time periods. For each time period,

    we would build bounding boxes around each of the trajectories of the objects in P, and

    then sample objects fromQ as described in Section 3.5 until the cost was estimated with

    sufficient accuracy.

    However, this simple algorithm results in a good deal of repeated work for each time

    period, and can actually decrease the overall speed of the adaptive plane-sweep compared

    to the layered plane-sweep. A more intelligent implementation can speed the optimization

    process considerably.

    In our implementation, we maintain a table of all the objects in P andQ, organized

    on the ID of each object. Each entry in the table points to a linked list that contains a

    46

  • 8/13/2019 Arumugam s

    47/123

    chain of MBRs for the associated object. Each MBR in the list bounds the trajectory

    of the object for one of the k time-periods considered during the optimization, and the

    MBRs in each list are sorted from the coarsest of the k granularities to the finest. The

    data structure is depicted in Figure 3-12.

    Given this structure, we can estimatecost for each of the k values of alpha in

    parallel, with only a small hit in performance associated with an increased value for k.

    Any object pair (p P, q Q) that needs to be evaluated during the sampling process

    described in Section 3.5 is first evaluated at the coarsest granularity corresponding to k.

    If the two MBRs are within distance d of one another, then the cost estimate for k is

    updated, and the evaluation is then repeated at the second coarsest granularity k1. If

    there is again a match, then the cost estimate fork1 is updated as well. The process is

    repeated until there is not a match. As soon as we find a granularity at which the MBRs

    forp and qare not within a distance d of one another, then we can stop the process,

    because if the MBRs for P and Q are not within distance dfor the time period associated

    withi, then they cannot be within this distance for any time period j where j < i.

    The benefit of this approach is that in cases where the data are well-behaved and

    the optimization process tends to choose a value forthat causes the entire buffer to be

    processed at once, a quick check of the distance between the outer-most MBRs ofp and q

    is the only geometric computation needed to process p andq, no matter what value ofk is

    chosen.

    The bounding box approximations themselves can be formed while the system buffer

    is being filled with data from disk. As trajectory data are being read from disk, we grow

    the MBRs for each i progressively. Since eachi represents a fraction of the buffer, theupdates to its MBR can be stopped as soon as that much fraction of the buffer has been

    filled. Similar logic can be used to shrink the MBRs when some fraction of the buffer is

    consumed and expand it when the buffer is refilled.

    47

  • 8/13/2019 Arumugam s

    48/123

    .

    .

    .

    3 4 5

    p2

    p1

    2 time

    y

    pn

    p2

    (5 ) (4 ) (3 ) (2) (1)MBRMBRMBR MBRMBR

    1

    Figure 3-13. Speeding up the Optimizer

    3.5.7 Putting It All Together

    In our implementation of the adaptive plane-sweep, data are fetched in blocks

    and stored in the system buffer. Then an optimizer routine is called which evaluatesk granularities and returns the granularity with the minimum cost. Data in the

    granularity chosen by the optimizer is then evaluated using the LayeredPlaneSweep

    routine (procedure described in Section 3.4). When the LayeredPlaneSweep routine returns,

    the buffer is refilled and the process is repeated. The techniques described in the previous

    Section are utilized to make the optimizer implementation fast and efficient.

    3.6 Benchmarking

    This section presents experimental results comparing the various methods discussed so

    far for computing a spatiotemporal CPA Join: an R-tree, a simple plane-sweep, a layered

    plane-sweep, and an adaptive plane-sweep with several parameter settings. The Section is

    organized as follows. First, a description of the three, three-dimensional temporal data sets

    used to test the algorithms is given. This is followed by the actual experimental results

    and a detailed discussion analyzing the experiemental data.

    3.6.1 Test Data Sets

    The first two data sets that we use to test the various algorithms result from two

    physics-based,N-body simulations. In both data sets, constituent records occupy 80B on

    disk (80B is the storage required to record the object ID, time information, as well as the

    48

  • 8/13/2019 Arumugam s

    49/123

  • 8/13/2019 Arumugam s

    50/123

    -10000

    -5000

    0

    5000

    10000-12000

    -10000

    -8000

    -6000

    -4000

    -2000

    0

    2000

    -6000

    -4000

    -2000

    0

    2000

    4000

    6000

    8000

    Figure 3-15. Collision data set at time tick 1,500

    strong graviational interaction. A small sample of the galaxies in the simulation isdepicted above in Figure 3-14, at time tick 1,500.

    In addition, we test a third data set created using a simple, 3-dimensional random walk.

    We call this the Syntheticdata set (this data set was again about 50GB in size). The

    speed of the various objects varies considerably during the walk. The purpose of including

    this data is to rigorously test the adpatability of the adaptive plane-sweep, by creating

    a synthetic data set where there are significant fluctuations in the amount of interaction

    among objects as a function of time.

    3.6.2 Methodology and Results

    All experiments were conducted on a 2.4GHz Intel Xeon PC with 1GB of RAM. The

    experimental data sets were each stored on an 80GB, 15,000 RPM Seagate SCSI disk.

    For all three of the data sets, we tested an R-tree-based CPA Join (implemented as

    described in Section 3.3; we used the STR R-tree packing algorithm [80] to construct anR-tree for each input relation), a simple plane-sweep (implemented as described in Section

    3.4), a layered plane-sweep (implemented as described in Section 3.5).

    We also tested the adaptive plane-sweep algorithm, implemented as described

    in Section 6. For the adaptive plane-sweep, we also wanted to test the effect of the

    50

  • 8/13/2019 Arumugam s

    51/123

    0

    2000

    4000

    6000

    8000

    10000

    12000

    0 20 40 60 80 100

    TimeTaken

    % of Join Completed

    CPA-Join over Injection Dataset

    R-Tree

    Simple Sweep

    Layered Sweep