optimizing probabilistic query processing on continuous uncertain data liping peng yanlei diao anna...

25
University of Massachusetts Amherst · Department of Computer Science Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

Upload: washi

Post on 25-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US. TV. Applications of Uncertain Data Management . Motivating Application – Sloan Digital Sky Survey. Q1:. SELECT * FROM Galaxy AS G WHERE G.r < 22 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

University of Massachusetts Amherst · Department of Computer Science

Optimizing Probabilistic Query Processing on Continuous Uncertain

Data

Liping PengYanlei DiaoAnna Liu

VLDB 2011Seattle WA, US

Page 2: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

2Department of Computer Science

Applications of Uncertain Data Management

TV

Page 3: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

3Department of Computer Science

Motivating Application – Sloan Digital Sky Survey

SELECT *FROM Galaxy AS GWHERE G.r < 22AND G.q_r2+G.u_r2 > 0.25

Q1:

SELECT *FROM Galaxy AS G1, Galaxy AS G2WHERE G1.OBJ_ID < G2.OBJ_IDAND |(G1.u-G1.g)-(G2.u-G2.g)| < 0.05AND |(G1.g-G1.r)-(G2.g-G2.r)| < 0.05AND (G1.rowc-G2.rowc)2+ (G1.colc-G2.colc)2 < 1E4

Q2:

name type descriptionOBJ_ID bigint SDSS identifier… …(rowc, rowc_err) real (row center position, error term)

(colc, colc_err) real (column center position, error term)(q_u, qErr_u) real (stokes Q parameter, error term)(u_u, uErr_u) real (stokes U parameter, error term)(ra, dec, ra_err, dec_err, ra_dec_corr)

real (right ascension, declination, error in ra, error in dec, ra/dec correlation)

… …

Continuous uncertain data

Complex selection and join predicates

Return answers of high confidence efficiently

Page 4: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

4Department of Computer Science

Previously Proposed Data Model

[Tran et al. PODS: A New Model and Processing Algorithms for Uncertain Data Streams. SIGMOD 2010 Tran et al. Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations. PVLDB 2010]

Gaussian Mixture Models (GMMs) for continuous uncertain attributes

Object_ID Speed X Y

MA123456

• Flexible• Succinct• Computation efficiency

Tuple model TEP

0.7

Page 5: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

5Department of Computer Science

Scope of Problem

SELECT *FROM Galaxy AS GWHERE G.r > 22

rid

1

2

Probabilistic threshold query processing and optimization• Avoid expensive operations for non-viable tuples• Find efficient plans based on predicates and distributions

TEP

0.8

rid

1

2 0.5

(λ=0.7)

Continuous uncertain dataGaussian Mixture Models (GMMs)

Select-Project-Join (SPJ) queries with threshold λ

Results with tuple existence probability (TEP) >λ

TEP

1

1

Page 6: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

6Department of Computer Science

Outline Motivation

Optimize Probabilistic Threshold Selections

Optimize Probabilistic Threshold Joins

Per-tuple Based Planning and Execution

Evaluation

Page 7: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

7Department of Computer Science

SELECT *FROM Galaxy AS GWHERE G.q_r2+G.u_r2 < 0.25

Probabilistic Threshold Selections

Given a tuple with distribution f, the probability to satisfy θ:

Return tuples with TEP>0.8 (λ)

S={q_r, u_r}Continuous uncertain attributes:

Selection condition θ

Selection region Rθq_r

u_r

u_rq_r

f

>

Page 8: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

8Department of Computer Science

Probabilistic Threshold Selections

Given a tuple with distribution f, the probability to satisfy θ:

Return tuples with TEP>0.8 (λ)

Selection condition θ

Selection region Rθ

SELECT * FROM Galaxy AS G1, Galaxy AS G2WHERE G1.OBJ_ID < G2.OBJ_IDAND |(G1.u-G1.g)-(G2.u-G2.g)| < 0.05AND |(G1.g-G1.r)-(G2.g-G2.r)| < 0.05AND (G1.rowc-G2.rowc)2+(G1.colc-G2.colc)2 < 1E4

Q2:

S={G1.u, G1.g, G1.r, G1.rowc, G1.colc, G2.u, G2.g, G2.r, G2.rowc, G2.colc}

Continuous uncertain attributes:

A high-dimensional integral for each tuple!

>

Page 9: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

9Department of Computer Science

A general approach to derive an upper bound Given a tuple X, define a (multi-dim) Chebyshev region

Test the overlap of Rλ(X) with predicate region Rθ

• If Rλ(X) and Rθ are disjoint, filter the tuple

Applying Fast Filters to Avoid IntegralsDerive an upper bound (Ũ) for the integral at a low cost

• If Ũ<λ, filter tuples without computing integrals• Otherwise, still integrate to compute the true probability

A geometric intersection problem Constrained optimization generally. Can

use techniques like Lagrange multiplier

Rλ(X)

Rθ0.2

0.2

-0.2

-0.2

u

g

|u|<0.2 and |g|<0.2

Fast filters for common predicates

Page 10: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

10Department of Computer Science

Reducing Dimensionality of Integration

σθ : n-dim space

• region: Rθ

• distribution: fX(x)• integral:

σθ’ : m-dim space

• region: R’θ = {y|y=Bx+b, x Rθ}

• distribution: fY(y)• integral:

Linear transformation (LT):

Y=BX+b

An algorithm to find a transformation matrix Bm×n m≤n

if m<n, LT helps to reduce dimensionality if m=n, LT does not help

Let Xn~N(μ,Σ) and Y=Bm×nX+bm×1 then Ym~N(Bμ+b,BΣBT)

Page 11: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

11Department of Computer Science

Outline Motivation

Optimize Probabilistic Threshold Selections

Optimize Probabilistic Threshold Joins

Per-tuple Based Planning and Execution

Evaluation

Page 12: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

12Department of Computer Science

Probabilistic Threshold Joins

Key idea: filtered cross-product using indexes• For each tuple r, the index returns a subset of S to pair with r• (r,s) pairs returned by include all true matches•

• A necessary condition for• “Tight” enough, a sufficient and necessary condition if possible

Large numbers of intermediate tuples!

A probabilistic threshold join of relations R and S is:

True match: tuple pair (r,s) such that >

>

Page 13: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

13Department of Computer Science

Designing an Index

search key query region

Deterministic

Probabilistic

Quantities concerning S Instantiate with quantities concerning R

S.AInstantiate with a

deterministic value of R.AE.g. when R.A=5, 5-b<S.A<5-a

A necessary condition for

Build an index on S for a<R.A-S.A<b

A distribution instead of a deterministic value!

Page 14: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

14Department of Computer Science

Theorem 1:

Search key:

Query region:

?Band Join of GMMs ( a<R.A-S.A<b) r.A: Xr, μr, σr

2

s.A: Xs, μs, σs2

Z=Xr-Xs follows a GMM with μz=μr-μs and σz

2=σr2+σs

2

x

y

Overlap test of RQ1 and RI [x1,x2;y1,y2]: RI overlaps with RQ1 if its upper left vertex (x1,y2) is in RQ1

μr-a

Necessary condition:

R1 R2

R3 R4 R5 R6 R7

x y

Page 15: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

15Department of Computer Science

Band Join of Gaussians (a<R.A-S.A<b)

Given Z~N(μ,σ2), Pr[a<Z<b] > λ iff there exists an such that

Search key: Query region:

Gaussian properties offer a sufficient and necessary condition

Overlap test: Requires math derivation; can be implemented efficiently

inverse of the standard normal cdf

Theorem 2:

x’

y’

Z=Xr-Xs

Page 16: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

16Department of Computer Science

Outline Motivation

Optimize Probabilistic Threshold Selections

Optimize Probabilistic Threshold Joins

Per-tuple Based Planning and Execution

Evaluation

Page 17: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

17Department of Computer Science

Query Planning

Faster filters based on inequalities

Filtered cross-product using indexes

LogicalOperators

PhysicalOperators

Exact selection using integrals (with LT)How to arrange operators to get an efficient plan ?

Page 18: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

18Department of Computer Science

Predicate Selectivities

20 25 3024

Per-tuple Based Planning

Tuple Attributesid r q_r u_r1 N(27, 2.2) N(1, 2.2) N(0.1, 1.1)

2 N(21, 0.1) N(0, 0.1) N(-0.1, 0.1)

Q1: SELECT * FROM Galaxy WHERE r < 24 AND q_r2+u_r2 > 0.25

Consider both selectivity and cost like the traditional planner Differences

• Exact selections are expensive due to the use of integrals• Selectivity should be defined on a per-tuple basis

=> The optimal order varies on a per-tuple basis

θ1θ2

Optimal plan for t1:

Optimal plan for t2:

0.08

1

0.95

0.0002

θ1 θ2

θ2 θ1

θ1 θ2

Page 19: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

19Department of Computer Science

Tuple-based Query Planning and Execution Tuple t1 from R needs to go through three selection

predicates and five join predicates

To-process tuple pool

σθ1 σθ2 σθ3

θ4 θ5 θ6 θ7 θ8

Predicates on R σθ1 σθ2 σθ3

Est. cost 100 300 104

SelectivityRank

Join R with S TPredicateEst. cost 500 300 100 104 50Has index Y Y N Y N#candidatesChoose

0.8 0.2 0.12 1 3

10 4 105 1021✓

t1

t4 t3 t2

Step 1: Estimate selectivities and rank selection predicates Step 2: Execute filters first, then exact selections Step 3: Choose a relation to join with Step 4: Execute the (filtered) cross-product

selection: θ4 θ5 θ6 join: θ7 θ8

θ4 θ5 θ6 θ7 θ8

Page 20: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

20Department of Computer Science

Outline Motivation

Optimize Probabilistic Threshold Selections

Optimize Probabilistic Threshold Joins

Per-tuple Based Planning and Execution

Evaluation using Data and Queries from SDSS

Page 21: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

21Department of Computer Science

Fast Filters for Selections

General filter v.s. Exact integration

SELECT * FROM Galaxy WHERE 100<rowc<100+δ AND 100<colc<100+δ (λ=0.7)

• Without filters, constant high cost for all ranges tested• With filters, per tuple cost is very low for small predicate ranges• More improvement for larger λ values tested

Data Characteristics• Gaussians (from SDSS)Parameters • δ: predicate range • λ: probability thresholdMetrics• Time per tuple

Page 22: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

22Department of Computer Science

xbound vs GaussJoin in efficiency

Xbound join index [R. Cheng et al. VLDB 2004 & CIKM 2006]• Given a distribution f and [l,u], store x% quantiles from both ends• A loose necessary condition for true matches

Indexes for Band Joins (stream)

xbound vs GaussJoin in filtering power

SELECT * FROM Galaxy AS R, Galaxy AS S WHERE |R.u-S.u|<δ (λ=0.7, W=500)

• Our index for Gaussians returns exactly the true match set• Xbound returns more candidates• Our index outperforms xbound in efficiency significantly

Page 23: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

23Department of Computer Science

Optimal query planning• Generate the best plan for each tuple offline

and load it into memory before execution

Static query planning [Y. Qi et al. SIGMOD 2010]• A fixed plan for each query based on the

selectivities of predicates over entire data set

Dynamic query planning• Rank predicates for each tuple

δ1 δ2staticorder

statictime (ms)

dynamic time (ms)

performancegain

optimaltime (ms)

20 0.2 [1 2] 0.6 0.181 70% 0.177

20 0.5 [1 2] 0.6 0.068 89% 0.067

20 1 [2 1] 9.6 0.050 99% 0.048

22 0.2 [2 1] 18.2 7.216 60% 7.007

22 0.5 [2 1] 13.9 1.515 89% 1.482

22 1 [2 1] 9.6 0.351 96% 0.348

24 0.2 [2 1] 18.2 15.613 14% 15.287

24 0.5 [2 1] 14.4 6.390 56% 6.334

24 1 [2 1] 9.6 2.264 76% 2.236

Tuple Based Planning and ExecutionSELECT *FROM Galaxy AS GWHERE G.r < δ1

AND G.q_r2+G.u_r2 > δ22

Q1:

θ1 θ2

Over 50% gains in most cases

Very close to the optimal in all cases

Page 24: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

24Department of Computer Science

Conclusions Optimize probabilistic threshold selections

• Fast filters to avoid integrals• Reducing dimensionality of integration by linear transformation

Optimize probabilistic threshold joins• Filtered cross-product using new indexes

Dynamic, per-tuple based planning Evaluation

• Significant performance gains over the state-of-the-art indexing technique and query optimizer

Future work• Extend to a larger class of queries including group-by aggregates• Support user-defined functions• Query optimization with correlated tuples

Page 25: Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US

25Department of Computer Science

Thank you!

Q & AOptimizing Probabilistic Query

Processing on Continuous Uncertain Data

Liping Peng Yanlei Diao Anna Liu

http://claro.cs.umass.edu/