icde-2006 subramanian arumugam christopher jermaine department of computer science university of...
DESCRIPTION
ICDE Challenges 3-dimensional space + time Large # of objects Massive amount of dataTRANSCRIPT
ICDE-2006
Subramanian Arumugam Christopher Jermaine
Department of Computer ScienceUniversity of Florida
22nd International Conference on Data Engineering
Closest-Point-of-Approach Join for Moving Object Histories
22nd International Conference on Data Engineering
2
ICDE-2006
SELECT distinct (r, s) FROM R as r, S as s, TIME tWHERE dist (r, s, t) < 0.5 AND (r(t).altd - s(t).altd) ≥ -1000 AND (r(t).altd - s(t).altd) ≤ 1000 AND s(t) C AND r(t) C AND t ≥ 'JAN-1-2005’ AND t ≤ 'MAR-31-2005'
“Find all commercial airliners that approached within 1000 vertical feet and 0.5 miles of a single engine plane in the BOS/JFK/EWR/LGA corridor C in the first three months of last year”
CPA-Join Is Useful For Analysis Of Spatiotemporal Data
Commercial airliners R, single engine planes S
3
ICDE-2006
Challenges• 3-dimensional space + time• Large # of objects• Massive amount of data
4
ICDE-2006
CPA Illustration for Straight Line TrajectoriesObject p
Object q
CPA - Position at which two dynamically moving objects attain their closest possible distance
5
ICDE-2006
50403020100 3
040
50
20
10
60
70
y
x
01
234
5
01 2
34
5
40,32 38,1851,27 49,12 5,32 6,26
15,39 59,18 27,38 11,49
5,32 24,65
Time Object P Object Q 0 1 2 3 4 5
50403020100 3
040
50
20
10
60
70
y
x
01
234
5
01 2
34
5
Polyline approximation
Sampled PositionsMoving Object Trajectories
distcpa
6
ICDE-2006
Simple CPA-Join
Procedure CPA (Object P, Object Q, distance d)1. List result = {};2. for each pair of segments (p P, q Q) 3. if CPA_distance (p,q) d4. result += (p,q);5. return result;
Need to compare only those segments whose time interval overlaps
Plane sweep
Find all object pairs (p P, q Q) from relations P and Q such that CPA-distance (p,q) d
7
ICDE-2006
CPA-Join using Simple Plane Sweep- First sort the segments in P and Q along time dimension (external
sort)- While there is still some unprocessed data
- Read in enough segments from P and Q to fill the main memory buffer
- Next, sweep a vertical line along the time dimension.- Maintain a sweepline data structure which keeps tracks of all
active segments that intersect the sweep line- As the sweep line progresses, the sweepline data structure is
updated with insertions (new segments that became active) and deletions (segments whose time period has expired)
- During updates to the sweepline structure, an all-pairs comparison returns valid results’
8
ICDE-2006
CPA-Join using Plane Sweep
Sweep line has to pause at every new sample point encountered. Processing multi-gigabyte dataset can take a long time
memory
disk
9
ICDE-2006
Group segments using a bounding box approximation
diskIn the best case, just 1 comparison is needed
memory
memory
disk
10
ICDE-2006
Algorithm: Layered Plane SweepWhile there is still some unprocessed data in disk
Read in data from relations P and Q to fill in the bufferConstruct MBR for the trajectory of every object in the bufferSort MBRs along one of the spatial dimension and do a plane-sweep in it to identify qualifying MBR pairsExpand the MBRs to obtain the individual segmentsSort segments along time dimension and do a plane-sweep along time to obtain the actual results
11
ICDE-2006
Layered Plane-Sweep Example
But one size doesn’t fit all!
12
ICDE-2006
- Indexes can be used to do CPA-Join- But (almost) all indexes use MBR approximation- And MBRs impose predefined granularities
p
q
x
yz
A Note on Indexing
13
ICDE-2006
Layered Plane Sweep..what is the problem?• Layered Plane Sweep always processes the entire fraction of
data held in memory buffers• When objects interact heavily such an approach may lead to
no pruning at all
In the best case, just one comparison is needed
Though less buffer is processed initially, overall efficiency can be betterEfficiency of layered technique is not tied to the amount of data processed, but to choosing a granularity that minimizes the # of distance computations
14
ICDE-2006
Cost to Process Data in Memory Buffer • Cost can be approximated as a function of distance
computations (which dominate execution time) cost = (nseg + nMBR) where
nseg is the # of segment level comparisons nMBR is the # of bounding box comparisons
• In general, cost for a fraction (0 ≤ ≤ 1) of the buffer
cost = (nseg + nMBR) * (1/)
15
ICDE-2006
What we haveLayered Plane
Sweep•processes large fraction ( is large)•good when there is light interaction•bad when there is heavy interaction
Simple Plane Sweep•processes tiny fraction ( is small)•good when there is heavy interaction•bad when there is light interaction What we want
An Adaptive Algorithm•processes a fraction that maximizes performance ( varies)•Tunes to the characteristics of underlying data•Provide superior performance under all scenarios
16
ICDE-2006
Algorithm: Adaptive Plane SweepWhile there is still some unprocessed data in disk
Read in data from relations P and Q to fill in the bufferChoose a fraction of the data that maximizes performanceProcess the chosen fraction of data using Layered Plane Sweep
17
ICDE-2006
How many fractions should we consider?
How to estimate the cost for a given fraction ?
“Evaluate increasing buffer fractions from 0 to 1 and choose the fraction with the minimum cost”
Goal: Choose a fraction of data that maximizes performance
18
ICDE-2006
• Exact cost is known only after the fact! • To know the cost associated with a given , we
need to actually execute the join (layered plane sweep) at that granularity
How to estimate Cost for a given fraction
Estimate cost using a simple online sampling algorithm [HH97]
19
ICDE-2006
Cost Estimation through sampling
Given: Relations P and Q and alphaConsider segments within Construct MBRs for the objects in PUntil the estimate of cost is accurate to within +/- 10%
– Pick randomly an object q1 from Q and construct a MBR for its trajectory
– Join q1 with all objects in P– Compute nMBR,q1 and nseg,q1– Estimate cost
How to estimate Cost for a given fraction (Contd.)
20
ICDE-2006
How many fractions to consider?
– Computing cost for all not practical..it will offset any benefit that we gain from the adaptive technique..we need a strategy to limit the # of fractions that we process
“Evaluate increasing buffer fractions from 0 to 1 and choose the fraction with the minimum cost”
21
ICDE-2006
How many fractions to consider? vs cost graph is not linear, it exhibits convexity
2131415161718191
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Convex region represents the candidate region with the minimum cost
We can get-away with evaluating the cost for a small k fractions of
Fraction considered
Cost
(milli
ons)
22
ICDE-2006
2131415161718191
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
How to choose the k fractions?K = 10; tstart=32; tend=53
Fraction Time range
Cost
1= 0.11 [32-33.27] 90
2= 0.14 [32-33.61] 71
3= 0.18 [32-34.05] 52
4= 0.23 [32-34.60] 37
5= 0.30 [32-35.31] 31
6= 0.38 [32-36.21] 35
7= 0.48 [32-37.35] 41
8= 0.61 [32-38.80] 52
9= 0.78 [32-40.65] 59
10= 1.0 [32.0-53.0] 71
Acceptable candidates
r = tend - tstart
1 = r(1/k)/r
i = (r. 1)i/r
Fraction chosen can be fine-tuned through recursive calls
23
ICDE-2006
Putting it all together
Fill Buffer
Optimizer
Layered Plane Sweep
More data?
Relation R, S; distance d; Parameter k
Evaluate k fractions, choose best
Process join on best fraction
Read from relations R and S
24
ICDE-2006
Benchmarking• Code: Implemented and tested the various
alternatives in C/C++– R-Trees, Simple Sweep, Layered Sweep,
Adaptive Sweep with various parameter settings
• Workload: 2 relations, 100,000 objects (50 GB)– Physics-based Simulation data set– Synthetic data set
• Hardware: Linux 2.4 GHz pentium Xeon, 1 GB Main memory, 2 IDE drives 15,000 rpm
• Setup: 64 KB page size, buffer size 10,000 pages
25
ICDE-2006
Collision Data Set
100,000 objects, collision occurs during time range [1500 - 2500]
Snapshot at timetick 1500
26
ICDE-2006
Results - Execution Time for different Strategies
0
2000
4000
6000
8000
10000
12000
0 10 20 30 40 50 60 70 80 90 100
% of join completed
Exec
utio
n tim
e (s
econ
ds)
R-tre
e
simple sw
eep layered sweep
adaptive sweepK=20K=10K=5
27
ICDE-2006
Buffer Choices made by the optimizer
30
40
50
60
70
80
90
100
0 400 800 1200 1600 2000 2400 2800
Virtual time line in the data set
Frac
tion
of b
uffer
cho
sen
28
ICDE-2006
Discussion R-trees couldn’t do enough pruning to make a
difference Simple plane-sweep works well when there is
heavy interaction among objects Layered plane-sweep works well when there is
light interaction Adaptive version transitions smoothly between
these extremes Recursive call to fine-tune candidate region
doesn’t seem to help much
29
ICDE-2006
Conclusion…• CPA-Join for spatiotemporal relations• Proposed a novel adaptive join algorithm for
moving object histories based on extension of the plane-sweep
• Many practical applications