optimizing probabilistic query processing on continuous uncertain data liping peng yanlei diao anna...

University of Massachusetts Amherst · Department of Computer Science

Optimizing Probabilistic Query Processing on Continuous Uncertain

Data

Liping PengYanlei DiaoAnna Liu

VLDB 2011Seattle WA, US

2Department of Computer Science

Applications of Uncertain Data Management

TV


Motivating Application – Sloan Digital Sky Survey

SELECT *FROM Galaxy AS GWHERE G.r < 22AND G.q_r2+G.u_r2 > 0.25

Q1:

SELECT *FROM Galaxy AS G1, Galaxy AS G2WHERE G1.OBJ_ID < G2.OBJ_IDAND |(G1.u-G1.g)-(G2.u-G2.g)| < 0.05AND |(G1.g-G1.r)-(G2.g-G2.r)| < 0.05AND (G1.rowc-G2.rowc)2+ (G1.colc-G2.colc)2 < 1E4

Q2:

name type descriptionOBJ_ID bigint SDSS identifier… …(rowc, rowc_err) real (row center position, error term)

(colc, colc_err) real (column center position, error term)(q_u, qErr_u) real (stokes Q parameter, error term)(u_u, uErr_u) real (stokes U parameter, error term)(ra, dec, ra_err, dec_err, ra_dec_corr)

real (right ascension, declination, error in ra, error in dec, ra/dec correlation)

… …

Continuous uncertain data

Complex selection and join predicates

Return answers of high confidence efficiently


Previously Proposed Data Model

[Tran et al. PODS: A New Model and Processing Algorithms for Uncertain Data Streams. SIGMOD 2010 Tran et al. Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations. PVLDB 2010]

Gaussian Mixture Models (GMMs) for continuous uncertain attributes

Object_ID Speed X Y

MA123456

• Flexible• Succinct• Computation efficiency

Tuple model TEP

0.7


Scope of Problem

SELECT *FROM Galaxy AS GWHERE G.r > 22

rid

1

2

Probabilistic threshold query processing and optimization• Avoid expensive operations for non-viable tuples• Find efficient plans based on predicates and distributions

TEP

0.8

rid

1

2 0.5

(λ=0.7)

Continuous uncertain dataGaussian Mixture Models (GMMs)

Select-Project-Join (SPJ) queries with threshold λ

Results with tuple existence probability (TEP) >λ

TEP

1

1


Outline Motivation

Optimize Probabilistic Threshold Selections

Optimize Probabilistic Threshold Joins

Per-tuple Based Planning and Execution

Evaluation


SELECT *FROM Galaxy AS GWHERE G.q_r2+G.u_r2 < 0.25

Probabilistic Threshold Selections

Given a tuple with distribution f, the probability to satisfy θ:

Return tuples with TEP>0.8 (λ)

S={q_r, u_r}Continuous uncertain attributes:

Selection condition θ

Selection region Rθq_r

u_r

u_rq_r

f

>


Probabilistic Threshold Selections

Given a tuple with distribution f, the probability to satisfy θ:

Return tuples with TEP>0.8 (λ)

Selection condition θ

Selection region Rθ

SELECT * FROM Galaxy AS G1, Galaxy AS G2WHERE G1.OBJ_ID < G2.OBJ_IDAND |(G1.u-G1.g)-(G2.u-G2.g)| < 0.05AND |(G1.g-G1.r)-(G2.g-G2.r)| < 0.05AND (G1.rowc-G2.rowc)2+(G1.colc-G2.colc)2 < 1E4

Q2:

S={G1.u, G1.g, G1.r, G1.rowc, G1.colc, G2.u, G2.g, G2.r, G2.rowc, G2.colc}

Continuous uncertain attributes:

A high-dimensional integral for each tuple!

>


A general approach to derive an upper bound Given a tuple X, define a (multi-dim) Chebyshev region

Test the overlap of Rλ(X) with predicate region Rθ

• If Rλ(X) and Rθ are disjoint, filter the tuple

Applying Fast Filters to Avoid IntegralsDerive an upper bound (Ũ) for the integral at a low cost

• If Ũ<λ, filter tuples without computing integrals• Otherwise, still integrate to compute the true probability

A geometric intersection problem Constrained optimization generally. Can

use techniques like Lagrange multiplier

Rλ(X)

Rθ0.2

0.2

-0.2

-0.2

u

g

|u|<0.2 and |g|<0.2

Fast filters for common predicates


Reducing Dimensionality of Integration

σθ : n-dim space

• region: Rθ

• distribution: fX(x)• integral:

σθ’ : m-dim space

• region: R’θ = {y|y=Bx+b, x Rθ}

• distribution: fY(y)• integral:

Linear transformation (LT):

Y=BX+b

An algorithm to find a transformation matrix Bm×n m≤n

if m<n, LT helps to reduce dimensionality if m=n, LT does not help

Let Xn~N(μ,Σ) and Y=Bm×nX+bm×1 then Ym~N(Bμ+b,BΣBT)


Outline Motivation




Evaluation


Probabilistic Threshold Joins

Key idea: filtered cross-product using indexes• For each tuple r, the index returns a subset of S to pair with r• (r,s) pairs returned by include all true matches•

• A necessary condition for• “Tight” enough, a sufficient and necessary condition if possible

Large numbers of intermediate tuples!

A probabilistic threshold join of relations R and S is:

True match: tuple pair (r,s) such that >

>


Designing an Index

search key query region

Deterministic

Probabilistic

Quantities concerning S Instantiate with quantities concerning R

S.AInstantiate with a

deterministic value of R.AE.g. when R.A=5, 5-b<S.A<5-a

A necessary condition for

Build an index on S for a<R.A-S.A<b

A distribution instead of a deterministic value!


Theorem 1:

Search key:

Query region:

？Band Join of GMMs （ a<R.A-S.A<b) r.A: Xr, μr, σr

2

s.A: Xs, μs, σs2

Z=Xr-Xs follows a GMM with μz=μr-μs and σz

2=σr2+σs

2

x

y

Overlap test of RQ1 and RI [x1,x2;y1,y2]: RI overlaps with RQ1 if its upper left vertex (x1,y2) is in RQ1

μr-a

Necessary condition:

R1 R2

R3 R4 R5 R6 R7

…

x y


Band Join of Gaussians (a<R.A-S.A<b)

Given Z~N(μ,σ2), Pr[a<Z<b] > λ iff there exists an such that

Search key: Query region:

Gaussian properties offer a sufficient and necessary condition

Overlap test: Requires math derivation; can be implemented efficiently

inverse of the standard normal cdf

Theorem 2:

x’

y’

Z=Xr-Xs


Outline Motivation




Evaluation


Query Planning

Faster filters based on inequalities

Filtered cross-product using indexes

LogicalOperators

PhysicalOperators

Exact selection using integrals (with LT）How to arrange operators to get an efficient plan ?


Predicate Selectivities

20 25 3024

Per-tuple Based Planning

Tuple Attributesid r q_r u_r1 N(27, 2.2) N(1, 2.2) N(0.1, 1.1)

2 N(21, 0.1) N(0, 0.1) N(-0.1, 0.1)

Q1: SELECT * FROM Galaxy WHERE r < 24 AND q_r2+u_r2 > 0.25

Consider both selectivity and cost like the traditional planner Differences

• Exact selections are expensive due to the use of integrals• Selectivity should be defined on a per-tuple basis

=> The optimal order varies on a per-tuple basis

θ1θ2

Optimal plan for t1:

Optimal plan for t2:

0.08

1

0.95

0.0002

θ1 θ2

θ2 θ1

θ1 θ2


Tuple-based Query Planning and Execution Tuple t1 from R needs to go through three selection

predicates and five join predicates

To-process tuple pool

σθ1 σθ2 σθ3

θ4 θ5 θ6 θ7 θ8

Predicates on R σθ1 σθ2 σθ3

Est. cost 100 300 104

SelectivityRank

Join R with S TPredicateEst. cost 500 300 100 104 50Has index Y Y N Y N#candidatesChoose

0.8 0.2 0.12 1 3

10 4 105 1021✓

t1

t4 t3 t2

Step 1: Estimate selectivities and rank selection predicates Step 2: Execute filters first, then exact selections Step 3: Choose a relation to join with Step 4: Execute the (filtered) cross-product

selection: θ4 θ5 θ6 join: θ7 θ8

θ4 θ5 θ6 θ7 θ8


Outline Motivation




Evaluation using Data and Queries from SDSS


Fast Filters for Selections

General filter v.s. Exact integration

SELECT * FROM Galaxy WHERE 100<rowc<100+δ AND 100<colc<100+δ (λ=0.7)

• Without filters, constant high cost for all ranges tested• With filters, per tuple cost is very low for small predicate ranges• More improvement for larger λ values tested

Data Characteristics• Gaussians (from SDSS)Parameters • δ: predicate range • λ: probability thresholdMetrics• Time per tuple


xbound vs GaussJoin in efficiency

Xbound join index [R. Cheng et al. VLDB 2004 & CIKM 2006]• Given a distribution f and [l,u], store x% quantiles from both ends• A loose necessary condition for true matches

Indexes for Band Joins (stream)

xbound vs GaussJoin in filtering power

SELECT * FROM Galaxy AS R, Galaxy AS S WHERE |R.u-S.u|<δ (λ=0.7, W=500)

• Our index for Gaussians returns exactly the true match set• Xbound returns more candidates• Our index outperforms xbound in efficiency significantly


Optimal query planning• Generate the best plan for each tuple offline

and load it into memory before execution

Static query planning [Y. Qi et al. SIGMOD 2010]• A fixed plan for each query based on the

selectivities of predicates over entire data set

Dynamic query planning• Rank predicates for each tuple

δ1 δ2staticorder

statictime (ms)

dynamic time (ms)

performancegain

optimaltime (ms)

20 0.2 [1 2] 0.6 0.181 70% 0.177

20 0.5 [1 2] 0.6 0.068 89% 0.067

20 1 [2 1] 9.6 0.050 99% 0.048

22 0.2 [2 1] 18.2 7.216 60% 7.007

22 0.5 [2 1] 13.9 1.515 89% 1.482

22 1 [2 1] 9.6 0.351 96% 0.348

24 0.2 [2 1] 18.2 15.613 14% 15.287

24 0.5 [2 1] 14.4 6.390 56% 6.334

24 1 [2 1] 9.6 2.264 76% 2.236

Tuple Based Planning and ExecutionSELECT *FROM Galaxy AS GWHERE G.r < δ1

AND G.q_r2+G.u_r2 > δ22

Q1:

θ1 θ2

Over 50% gains in most cases

Very close to the optimal in all cases


Conclusions Optimize probabilistic threshold selections

• Fast filters to avoid integrals• Reducing dimensionality of integration by linear transformation

Optimize probabilistic threshold joins• Filtered cross-product using new indexes

Dynamic, per-tuple based planning Evaluation

• Significant performance gains over the state-of-the-art indexing technique and query optimizer

Future work• Extend to a larger class of queries including group-by aggregates• Support user-defined functions• Query optimization with correlated tuples


Thank you!

Q & AOptimizing Probabilistic Query

Processing on Continuous Uncertain Data

Liping Peng Yanlei Diao Anna Liu

http://claro.cs.umass.edu/

optimizing probabilistic query processing on continuous uncertain data liping peng yanlei diao anna...

Documents

continuous uncertain

multiple uncertain attributes

proposed data model

uncertainty of input

gwhere g

rowc2 g1

idand g1

error termcolc