icde-2006 subramanian arumugam christopher jermaine department of computer science university of...

Post on 18-Jan-2018

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

ICDE Challenges 3-dimensional space + time Large # of objects Massive amount of data

TRANSCRIPT

ICDE-2006

Subramanian Arumugam Christopher Jermaine

Department of Computer ScienceUniversity of Florida

22nd International Conference on Data Engineering

Closest-Point-of-Approach Join for Moving Object Histories

22nd International Conference on Data Engineering

2

ICDE-2006

SELECT distinct (r, s) FROM R as r, S as s, TIME tWHERE dist (r, s, t) < 0.5 AND (r(t).altd - s(t).altd) ≥ -1000 AND (r(t).altd - s(t).altd) ≤ 1000 AND s(t) C AND r(t) C AND t ≥ 'JAN-1-2005’ AND t ≤ 'MAR-31-2005'

“Find all commercial airliners that approached within 1000 vertical feet and 0.5 miles of a single engine plane in the BOS/JFK/EWR/LGA corridor C in the first three months of last year”

CPA-Join Is Useful For Analysis Of Spatiotemporal Data

Commercial airliners R, single engine planes S

3

ICDE-2006

Challenges• 3-dimensional space + time• Large # of objects• Massive amount of data

4

ICDE-2006

CPA Illustration for Straight Line TrajectoriesObject p

Object q

CPA - Position at which two dynamically moving objects attain their closest possible distance

5

ICDE-2006

50403020100 3

040

50

20

10

60

70

y

x

01

234

5

01 2

34

5

40,32 38,1851,27 49,12 5,32 6,26

15,39 59,18 27,38 11,49

5,32 24,65

Time Object P Object Q 0 1 2 3 4 5

50403020100 3

040

50

20

10

60

70

y

x

01

234

5

01 2

34

5

Polyline approximation

Sampled PositionsMoving Object Trajectories

distcpa

6

ICDE-2006

Simple CPA-Join

Procedure CPA (Object P, Object Q, distance d)1. List result = {};2. for each pair of segments (p P, q Q) 3. if CPA_distance (p,q) d4. result += (p,q);5. return result;

Need to compare only those segments whose time interval overlaps

Plane sweep

Find all object pairs (p P, q Q) from relations P and Q such that CPA-distance (p,q) d

7

ICDE-2006

CPA-Join using Simple Plane Sweep- First sort the segments in P and Q along time dimension (external

sort)- While there is still some unprocessed data

- Read in enough segments from P and Q to fill the main memory buffer

- Next, sweep a vertical line along the time dimension.- Maintain a sweepline data structure which keeps tracks of all

active segments that intersect the sweep line- As the sweep line progresses, the sweepline data structure is

updated with insertions (new segments that became active) and deletions (segments whose time period has expired)

- During updates to the sweepline structure, an all-pairs comparison returns valid results’

8

ICDE-2006

CPA-Join using Plane Sweep

Sweep line has to pause at every new sample point encountered. Processing multi-gigabyte dataset can take a long time

memory

disk

9

ICDE-2006

Group segments using a bounding box approximation

diskIn the best case, just 1 comparison is needed

memory

memory

disk

10

ICDE-2006

Algorithm: Layered Plane SweepWhile there is still some unprocessed data in disk

Read in data from relations P and Q to fill in the bufferConstruct MBR for the trajectory of every object in the bufferSort MBRs along one of the spatial dimension and do a plane-sweep in it to identify qualifying MBR pairsExpand the MBRs to obtain the individual segmentsSort segments along time dimension and do a plane-sweep along time to obtain the actual results

11

ICDE-2006

Layered Plane-Sweep Example

But one size doesn’t fit all!

12

ICDE-2006

- Indexes can be used to do CPA-Join- But (almost) all indexes use MBR approximation- And MBRs impose predefined granularities

p

q

x

yz

A Note on Indexing

13

ICDE-2006

Layered Plane Sweep..what is the problem?• Layered Plane Sweep always processes the entire fraction of

data held in memory buffers• When objects interact heavily such an approach may lead to

no pruning at all

In the best case, just one comparison is needed

Though less buffer is processed initially, overall efficiency can be betterEfficiency of layered technique is not tied to the amount of data processed, but to choosing a granularity that minimizes the # of distance computations

14

ICDE-2006

Cost to Process Data in Memory Buffer • Cost can be approximated as a function of distance

computations (which dominate execution time) cost = (nseg + nMBR) where

nseg is the # of segment level comparisons nMBR is the # of bounding box comparisons

• In general, cost for a fraction (0 ≤ ≤ 1) of the buffer

cost = (nseg + nMBR) * (1/)

15

ICDE-2006

What we haveLayered Plane

Sweep•processes large fraction ( is large)•good when there is light interaction•bad when there is heavy interaction

Simple Plane Sweep•processes tiny fraction ( is small)•good when there is heavy interaction•bad when there is light interaction What we want

An Adaptive Algorithm•processes a fraction that maximizes performance ( varies)•Tunes to the characteristics of underlying data•Provide superior performance under all scenarios

16

ICDE-2006

Algorithm: Adaptive Plane SweepWhile there is still some unprocessed data in disk

Read in data from relations P and Q to fill in the bufferChoose a fraction of the data that maximizes performanceProcess the chosen fraction of data using Layered Plane Sweep

17

ICDE-2006

How many fractions should we consider?

How to estimate the cost for a given fraction ?

“Evaluate increasing buffer fractions from 0 to 1 and choose the fraction with the minimum cost”

Goal: Choose a fraction of data that maximizes performance

18

ICDE-2006

• Exact cost is known only after the fact! • To know the cost associated with a given , we

need to actually execute the join (layered plane sweep) at that granularity

How to estimate Cost for a given fraction

Estimate cost using a simple online sampling algorithm [HH97]

19

ICDE-2006

Cost Estimation through sampling

Given: Relations P and Q and alphaConsider segments within Construct MBRs for the objects in PUntil the estimate of cost is accurate to within +/- 10%

– Pick randomly an object q1 from Q and construct a MBR for its trajectory

– Join q1 with all objects in P– Compute nMBR,q1 and nseg,q1– Estimate cost

How to estimate Cost for a given fraction (Contd.)

20

ICDE-2006

How many fractions to consider?

– Computing cost for all not practical..it will offset any benefit that we gain from the adaptive technique..we need a strategy to limit the # of fractions that we process

“Evaluate increasing buffer fractions from 0 to 1 and choose the fraction with the minimum cost”

21

ICDE-2006

How many fractions to consider? vs cost graph is not linear, it exhibits convexity

2131415161718191

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Convex region represents the candidate region with the minimum cost

We can get-away with evaluating the cost for a small k fractions of

Fraction considered

Cost

(milli

ons)

22

ICDE-2006

2131415161718191

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

How to choose the k fractions?K = 10; tstart=32; tend=53

Fraction Time range

Cost

1= 0.11 [32-33.27] 90

2= 0.14 [32-33.61] 71

3= 0.18 [32-34.05] 52

4= 0.23 [32-34.60] 37

5= 0.30 [32-35.31] 31

6= 0.38 [32-36.21] 35

7= 0.48 [32-37.35] 41

8= 0.61 [32-38.80] 52

9= 0.78 [32-40.65] 59

10= 1.0 [32.0-53.0] 71

Acceptable candidates

r = tend - tstart

1 = r(1/k)/r

i = (r. 1)i/r

Fraction chosen can be fine-tuned through recursive calls

23

ICDE-2006

Putting it all together

Fill Buffer

Optimizer

Layered Plane Sweep

More data?

Relation R, S; distance d; Parameter k

Evaluate k fractions, choose best

Process join on best fraction

Read from relations R and S

24

ICDE-2006

Benchmarking• Code: Implemented and tested the various

alternatives in C/C++– R-Trees, Simple Sweep, Layered Sweep,

Adaptive Sweep with various parameter settings

• Workload: 2 relations, 100,000 objects (50 GB)– Physics-based Simulation data set– Synthetic data set

• Hardware: Linux 2.4 GHz pentium Xeon, 1 GB Main memory, 2 IDE drives 15,000 rpm

• Setup: 64 KB page size, buffer size 10,000 pages

25

ICDE-2006

Collision Data Set

100,000 objects, collision occurs during time range [1500 - 2500]

Snapshot at timetick 1500

26

ICDE-2006

Results - Execution Time for different Strategies

0

2000

4000

6000

8000

10000

12000

0 10 20 30 40 50 60 70 80 90 100

% of join completed

Exec

utio

n tim

e (s

econ

ds)

R-tre

e

simple sw

eep layered sweep

adaptive sweepK=20K=10K=5

27

ICDE-2006

Buffer Choices made by the optimizer

30

40

50

60

70

80

90

100

0 400 800 1200 1600 2000 2400 2800

Virtual time line in the data set

Frac

tion

of b

uffer

cho

sen

28

ICDE-2006

Discussion R-trees couldn’t do enough pruning to make a

difference Simple plane-sweep works well when there is

heavy interaction among objects Layered plane-sweep works well when there is

light interaction Adaptive version transitions smoothly between

these extremes Recursive call to fine-tune candidate region

doesn’t seem to help much

29

ICDE-2006

Conclusion…• CPA-Join for spatiotemporal relations• Proposed a novel adaptive join algorithm for

moving object histories based on extension of the plane-sweep

• Many practical applications

30

ICDE-2006

Questions?

Thank You!

Subramanian (subi@ufl.edu)

top related